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Abstract 

There  has  been  recent  interest  in  using  a  class  of  incremental  learning  algorithms  called  temporal 
difference  learning  methods  to  attack  problems  of  prediction.  These  algorithms  have  been 
brought  to  bear  on  various  prediction  problems  in  the  past,  but  have  remained  poorly  understood. 
It  is  the  purpose  of  this  thesis  to  further  explore  this  class  of  algorithms,  particularly  the  TD  (k) 
algorithm.  A  number  of  practical  issues  are  raised  and  discussed  from  a  general  theoretical 
perspective  and  then  explored  in  the  context  of  several  case  studies.  The  thesis  presents  a 
framework  for  viewing  these  algorithms  independent  of  the  particular  task  at  hand  and  uses  this 
framework  to  explore  not  only  tasks  of  prediction,  but  also  prediction  tasks  that  require  control, 
whether  complete  or  partial.  This  includes  applying  the  TD  (X)  algorithm  to  two  tasks:  1)  learning 
to  play  tic-tac-toe  from  the  outcome  of  self-play  and  the  outcome  of  play  against  a  perfectly- 
playing  opponent  and  2)  learning  two  simple  one-dimensional  image  segmentation  tasks. 
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Chapter  1 


1  Learning  and  Neural  Networks 


Learning  is  a  question  central  to  the  study  of  Artificial  Intelligence.  The  sub  fields 
that  make  up  Al  are  varied  and  sometimes  quite  divergent  in  their  immediate 
goals  and  methods;  however,  the  goal  of  “learning”  in  a  way  that  mimics— and 
hopefully  illuminates— the  process  employed  by  human  beings  is  an  omnipresent 
one. 

There  has  been  great  interest  in  the  study  of  neural  networks  as  a  method  for 
attacking  this  problem  of  learning.  This  interest  has  led  to  the  creation  of  many 
different  structures  which  have  been  dubbed  “networks."  In  this  thesis,  I  will  use 
the  General  Radial  Basis  Function  (GRBF)  and  HyperBF  networks  (Poggio  and 
Girosi,  1990),  to  discuss  algorithms  for  training  neural  networks.  In  particular,  I 
will  discuss  and  propose  evaluation  criteria  for  the  TD  (X)  temporal  difference 
learning  algorithm  (Sutton,  1988).  As  a  training  algorithm  it  is  provably  equivalent 
to  the  more  widely-used  supervised  learning  algorithms;  however,  questions 
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remain  about  its  usefulness  and  efficiency  with  more  complex  real-world 
problems.  In  this  thesis  I  will  identify  a  number  of  practical  issues  that  this 
algorithm  must  address  and  use  several  case  studies  to  provide  an  empirical 
context  to  study  its  strengths  and  limitations. 

This  thesis  is  divided  into  several  parts.  This  first  chapter  introduces  feedforward 
networks  in  some  detail  and  broadly  defines  two  main  classes  of  algorithms  for 
training  them.  The  second  chapter  introduces  and  derives  another  class  of 
training  algorithms,  the  TD  {X)  algorithms,  and  distinguishes  them  from  the  two 
aforementioned  classes  as  a  temporal  difference  learning  method.  Theoretical 
work  is  presented,  relating  TD  (X)  to  currently  studied  problems  and  to  the 
prediction  paradigm  for  which  the  algorithm  should  be  ideal.  In  addition,  this 
chapter  develops  a  theoretical  and  algorithmic  formalism  for  studying  TD  (X), 
allowing  one  to  encompass  not  only  tasks  of  simple  prediction,  but  more 
complicated  prediction  tasks  that  involve  control — whether  complete  or  partial — 
as  well.  Chapter  three  presents  related  work,  describing  many  of  the  practical 
issues  that  the  TD  (X)  method  raises  and  attempts  to  show  where  this  research 
relates  to  the  greater  body  of  work.  Chapters  four  and  five  describe  the  case 
studies  used  in  this  research  to  evaluate  the  TD  (X)  algorithm,  relating  them  to 
the  practical  issues  discussed  in  the  previous  chapter.  In  particular,  the  case 
studies  explore  two  problems  involving  prediction  and  control:  determining  an 
evaluation  function  for  tic-tac-toe  positions  through  both  self-play  and  play 
against  a  superior  opponent  and  simulating  a  restricted  class  of  recurrent 
networks  to  learn  to  do  two  kinds  of  simple  segmentation.  Results  are  presented 
and  discussed.  Finally,  chapter  six  concludes  with  a  brief  discussion  of  TD  (X) 
and  a  review  of  the  thesis. 
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1.1  Why  networks  at  all? 


In  practice,  neural  nets  are  difficult  to  train  and  often  have  trouble  performing 
even  what  would  seem  to  be  simple  tasks.  As  number-crunching  mechanisms, 
networks  are  often  unable  to  deal  easily  with  the  processing  of  symbols;  are 
extremely  sensitive  to  the  representation  used  for  the  data;  and  oftentimes 
require  an  inordinate  amount  of  time  to  train.  Still,  neural  networks  do  enjoy 
several  advantages: 

1 .  They  can  leam  to  perform  tasks  for  which  computational  algorithms  do  not 
exist  or  are  poorly  understood. 

2.  They  leam  on  the  fly,  adapting  their  behavior  to  a  changing  environment. 

3.  As  mathematical  abstractions,  they  are  not  wedded  to  any  specific 
algorithmic  engine  for  training. 

4.  They  inherit  a  wealth  of  theory  and  empirical  data  from  approximation 
theory,  particularly  from  the  fields  of  regression  and  statistical  inference. 

In  practice,  the  first  property  is  probably  the  most  important.  For  many  problems 
of  interest,  the  level  of  understanding  of  the  problem  is  poor.  For  example,  with  a 
computer  vision  problem,  we  might  want  to  perform  some  sort  of  object 
recognition  but  are  unable  to  actually  define  what  we  mean  by  the  “object.” 
Despite  this  lack  of  a  specification,  we  can  usually  describe  how  a  correct 
algorithm  should  perform  on  particular  examples.  In  this  case,  we  can  define 
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learning  as  a  process  of  associating  particular  inputs  to  particular  outputs,  that  is 
as  function  approximation.  This  allows  our  analysis  of  neural  networks  to  draw 
upon  approximation  theory,  inheriting  information  from  the  statistical  and 
regression  communities. 

If  a  self-training  network  is  capable  of  discovering  a  function  that  performs 
correctly  on  a  set  of  examples,  it  might  be  able  to  generalize  to  solve  the  problem 
for  inputs  that  it  has  yet  to  see.  If  the  network  does  well  in  generalizing  the 
problem,  this  is  an  indication  that  it  has  discovered  some  important  underlying 
structure.  If  this  is  the  case,  subsequent  analysis  of  the  network’s  “answer"  might 
contribute  to  our  understanding  of  the  original  problem.  Unfortunately,  this 
generalization  problem  is  ill-posed:  any  finite  set  of  examples  for  a  function  is 
consistent  with  an  infinite  number  of  functions,  many  of  which  may  have  nothing 
to  do  with  the  original  problem. 

The  second  property  of  networks — their  built-in  adaptability — is  also  important.  If 
a  neural  network’s  behavior  is  learned  in  the  first  place,  then  re-leaming  based 
on  some  change  in  the  environment  should  be  easy  to  implement.  This  is,  of 
course,  a  property  found  in  all  high-level  biological  organisms  and  one  computer 
scientists  seek  to  emulate. 

As  mathematical  abstractions,  neural  network  structures  should  be  independent 
of  changes  in  a  particular  training  algorithm.  The  purpose  of  a  training  algorithm 
is  to  adjust  the  parameters  of  the  network  to  move  its  outputs  closer  to  the 
desired  outputs.  Of  course,  some  training  algorithms  perform  better  than  others 
and  some  algorithms  might  learn  more  quickly  with  certain  networks  than  with 
others.  Still,  the  choice  of  a  training  algorithm  should  not  change  the  inherent 
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Output  layer 


Hidden  Layers 


Input  Layer 


Figure  1.1:  A  perceptron-like  feedforward  network,  input  values  are  “clamped"  at 
the  input  layer.  A  unit  at  a  higher  level  receives  a  weighted  sum  of  the  values  of  units 
below  as  input.  It  then  passes  this  input  through  a  function,  usually  non-linear,  to 
produce  its  own  value. 


power  of  the  network  itself.  This  allows  researchers  to  experiment  with  various 
algorithmic  engines  without  affecting  the  universality  of  whatever  neural  network 
they  choose  to  use  for  a  base. 


1.2  Radial  Basis  Function  Networks 

Neural  networks  consist  of  a  set  of  interconnected  computational  units.  The 
connections  are  directed  and  usually  weighted.  For  most  kinds  of  networks,  it  is 
these  weights  that  training  algorithms  adjust. 
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One  of  the  most  common  types  of  networks  is  the  feedforward  network.  A 
feedforward  network  is  any  network  that  can  be  divided  into  distinct  layers,  such 
that  there  are  no  connections  from  a  unit  in  an  upper  layer  to  any  unit  in  a  lower 
layer.  The  lowest  layer  consists  of  input  units  onto  which  input  values  are 
clamped.  These  values  are  then  passed  through  a  set  of  weights  to  produce  the 
inputs  to  the  next  layer  of  so-called  “hidden”  units.  These  units  take  their  input 
and  modify  them  using  some  function,  passing  their  values  to  the  next  layer. 
This  process  is  continued,  with  the  output  of  the  network  being  the  outputs  of  the 
final  layer  of  units.  A  typical  feedforward  network  is  shown  in  figure  1.1.  For 
simplicity,  we  have  assumed  that  the  function  at  each  unit  is  the  same. 

If  Xj  denotes  the  output  of  unit  j  and  denotes  the  weight  on  the  connection 
from  unit  /  to  j  (where  wtj  can  be  zero),  we  can  express  the  output  of  unit  j 
simply: 


=/(Z>v0- 

i 


(1.1) 


The  functions  at  each  unit  do  not  have  to  be  nonlinear  but  they  usually  are.  In 

sigmoidal  networks,  for  example,  each  unit  employs  a  sigmoidal  function,  such  as 
the  logistic  function  (  f(x)  =  y-~ 7). 

One  particular  kind  of  feedforward  network  is  the  radial  basis  function  network 
(Broomhead  and  Lowe,  1988;  Poggio  and  Girosi,  1989).  This  type  of  network 
always  contains  three  layers:  a  layer  of  input  units,  a  hidden  layer  of  radial  basis 
function  (RBF)  units  and  a  layer  of  output  units.  Each  of  the  RBF  units  has  a 
vector  of  parameters,  tit  called  a  center  and  is  connected  to  the  output  units  by  a 
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Outputs 

Centers 

Inputs 

Figure  1.2:  An  RBF  network.  Ail  RBF  networks  have  three  layers:  inputs,  centers 
and  outputs. 

weighted  coefficient  vector,  c,.  We  can  express  the  value  of  an  output  unit,  j, 
as: 

=  (1.2) 

G()is  a  radial  basis  function,  usually  a  gaussian  (G(x)  =  e~<*2)  or  multiquadratic 
(G(x)  =  tJy2 +x2 ),  and  Jjc|  represents  the  4  norm. 

In  an  RBF  network,  the  number  of  RBF  units  is  equal  to  the  number  of  training 
examples  with  each  center,  ij,  set  equal  to  one  of  the  training  examples.  Only 
the  coefficients,  cif  must  be  learned  in  this  case.  This  reduces  the  process  of 
learning  to  a  simple  linear  problem  solvable  by  matrix  inversion  (Broomhead  and 
Lowe,  1988). 
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The  RBF  network  can  be  generalized  by  allowing  fewer  centers  than  training 
examples.  The  centers  of  this  generalized  radial  basis  function  network  (GRBF) 
are  usually  initialized  to  some  subset  of  the  training  examples.  The  centers  can 
then  remain  fixed  or  be  allowed  to  change.  In  general,  if  the  centers  remain 
fixed,  the  system  of  linear  equations  is  over  constrained;  however,  the  pseudo¬ 
inverse  can  be  used  to  find  a  mapping  with  the  smallest  possible  error  on  the 
training  examples  (Poggio  and  Girosi,  1989). 

Poggio  and  Girosi  (1990)  have  proposed  a  further  generalization:  weighting  the 
connections  between  the  input  units  and  the  RBF  units,  effectively  replacing  the 
4  norm  with  a  weighted  norm: 


O-3) 

i-i 

where 

(i  -  tfw  =  (x  -  tfWrW0 c  - 1,).  (1 .4) 

If  the  weighting  matrix,  W,  is  diagonal  (i.e.  may  only  have  non-zero  values  along 
its  diagonal),  then  for  some  simple  tasks  it  is  possible  to  interpret  each  diagonal 
component  of  W  as  indicating  the  importance  or  contribution  of  the 
corresponding  component  of  the  input  vectors.  A  key  component,  xit  is 
exaggerated  by  a  large  value  for  wti  while  an  unimportant  component,  xjt  is 
minimized  by  a  small  w jj. 

This  kind. of  RBF  network,  referred  to  as  a  HyperBF  network,  is  derived  from 
regularization  theory.  By  imposing  smoothness  constraints,  the  ill-posed  problem 
of  generalizing  a  function  from  input-output  example  pairs  is  changed  into  a  well- 
posed  one.  Like  both  the  RBF  and  GRBF  networks,  the  HyperBF  network  can 
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approximate  any  continuous  function  arbitrarily  well  on  a  compact,  finite  set 
(Poggio  and  Girosi,  1989). 


For  the  case  studies  in  this  thesis,  we  have  chosen  to  use  a  GRBF  network  using 
the  gaussian  as  the  radial  basis  function: 


c-e 

9 


(1.5) 


with  a  distinct,  adjustable  <r,  for  each  adjustable  center.  Results  should  not  be 
limited  to  just  this  type  of  network. 


1.3  Supervised  versus  Unsupervised  Learning 

Whatever  the  kind  of  network  used,  training  algorithms  have  been  traditionally 
divided  into  two  major  categories:  supervised  and  unsupervised  (Hinton,  1 987; 
Lippman,  1987).  Generally  speaking,  a  supervised  learning  algorithm  is  any 
algorithm  that  involves  a  knowledgeable  teacher  who  provides  the  correct  answer 
for  every  input  example  presented  to  the  network.  With  an  unsupervised  learning 
algorithm  there  is  no  teacher  and  the  network  is  left  to  discover  some  useful 
structure  on  its  own.  Of  course  there  is  an  implicit  mapping  that  the  network 
must  learn  in  any  unsupervised  algorithm  and,  therefore,  an  implicit  teacher.  In 
fact,  it  is  sometimes  possible  to  simulate  one  type  of  algorithm  with  an  algorithm 
of  the  other  type. 

As  such,  it  may  be  most  practical  to  describe  the  differences  between  supervised 
and  unsupervised  algorithms  as  differences  in  goals,  as  opposed  to  technique: 
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the  object  of  supervised  learning  is  to  approximate  a  particular  input-output 
mapping  while  the  object  of  unsupervised  learning  is  to  find  a  mapping  which 
possesses  some  specific  underlying  properties  that  have  been  deemed 
important.  Both  methods  have  their  strengths  and  weaknesses. 

Supervised  learning  algorithms  have  met  with  considerable  success  in  solving 
some  difficult  tasks,  fairing  somewhat  better  in  this  regard  than  unsupervised 
learning  algorithms.  This  makes  some  sense.  While  it  is  not  too  difficult  to 
imagine  that  there  might  exist  a  (complex)  function  that  maps,  say,  bit-image 
representations  of  hand-drawn  digits  to  the  numbers  they  represent,  it  seems  a 
bit  harder  to  imagine  an  important  “underlying  principle"  that  would  straight¬ 
forwardly  accomplish  the  same. 

On  the  other  hand,  supervised  learning  algorithms  have  been  limited  by  their 
poor  scaling  behavior  and  tend  to  produce  problem-specific  representations  that 
do  not  carry  over  well  to  new  tasks.  Unsupervised  learning  algorithms  seem  to 
do  better  in  this  regard.  Further,  they  appeal  to  the  goal  of  emulating  the  human 
learning  process,  which  at  least  seems  to  be  unsupervised. 

Of  course,  these  two  categories  do  not  exhaust  the  possibilities.  Clearly  there  is 
a  continuum  of  learning  types  between  these  two  extremes.  For  example,  we 
could  combine  some  sort  of  “underlying  principle”  that  generalizes  well  to  many 
problems  with  the  power  of  an  external  teacher.  In  this  way  we  can  guide  a 
network  to  a  final  solution  which  not  only  performs  complex  tasks,  but  chooses 
functions  that  capture  important  underlying  structures. 
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One  class  of  algorithms  that  approaches  the  problem  of  learning  in  a  way  that  is 
different  than  both  unsupervised  and  supervised  learning  algorithms  is  the  class 
of  temporal  difference  learning  methods.  Instead  of  changing  network 
parameters  by  means  of  the  difference  between  predicted  and  actual  outputs, 
these  methods  update  parameter  values  by  means  of  the  difference  between 
temporally  successive  predictions  (Sutton,  1988).  Feedback  is  usually  provided 
by  a  teacher  at  the  end  of  a  series  of  predictions.  This  combines  the  principle  of 
temporal  (and  spatial)  coherence — the  notion  that  the  environment  is  stable  and 
smooth  and  any  function  that  predicts  behavior  should  reflect  that  notion — with 
the  power  of  an  external  teacher. 

In  the  following  chapters,  we  will  explore  the  TD  (X)  algorithm,  a  member  of  this 
class,  evaluating  its  usefulness  with  a  number  of  test  cases  and  exploring 
whether  real-world  problems  can  be  better  thought  of  as  prediction  problems  and, 
perhaps,  better  attacked  by  this  class  of  learning  algorithms. 


11 


Chapter  2 


2  Temporal  Difference  Learning 


In  this  chapter  we  discuss  a  class  of  learning  algorithms  called  temporal 
difference  learning  (TD)  methods.  This  is  a  class  of  incremental  learning 
procedures  specialized  for  prediction  problems.  As  noted  earlier,  more  traditional 
learning  procedures  update  parameters  by  means  of  the  error  between  the 
neural  net’s  predicted  or  proposed  output  and  the  actual  or  desired  output.  TD 
learning  methods  are  driven  instead  by  the  error  between  temporally  successive 
predictions.  In  this  way  learning  actually  occurs  whenever  there  is  a  change  in  a 
prediction  over  time. 

The  earliest  use  of  a  TD  method  was  Samuel’s  (1959)  checker-playing  program. 
For  each  pair  of  successive  game  positions,  the  program  would  use  the 
difference  between  the  evaluations  of  the  two  positions  to  modify  the  earlier 
position’s  evaluation.  Similar  methods  have  been  used  in  Holland’s  (1986) 
bucket  brigade,  Sutton’s  (1984)  Adaptive  Heuristic  Critic  and  Tesauro’s  (1991) 
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Backgammon  program.  Unfortunately,  TO  algorithms  have  remained  poorly 
understood.  Sutton  (1988)  has  provided  a  theoretical  foundation  for  their  use, 
proving  convergence  and  optimality  for  special  cases;  Oayan  (1991)  has 
extended  Sutton’s  proofs  and  Tesauro  (1991)  has  provided  an  empirical  study  of 
the  superiority  of  TO  algorithms  in  at  least  one  domain;  however,  it  is  still  unclear 
how  well  these  algorithms  can  perform  in  general  with  complex,  real-world 
domains  or  with  structures  other  than  linear  and  sigmoidal  networks.  To  explore 
these  issues,  we  will  first  discuss  in  some  detail  the  different  approaches  used  by 
temporal  difference  and  more  traditional  learning  methods  to  solve  prediction 
problems.  Then,  we  will  explore  the  difference  between  problems  of  simple 
prediction  and  problems  of  both  prediction  and  control,  proposing  a  general 
framework  for  discussing  both. 


2.1  Temporal  Difference  versus  Traditional  Approaches  to  Prediction 

Suppose  that  we  attempt  to  predict  on  each  day  of  the  week  whether  it  will  rain 
the  following  Monday.  A  traditional,  supervised,  approach  would  compare  the 
prediction  of  each  day  to  the  actual  outcome,  while  a  TD  approach  would 
compare  each  day’s  prediction  to  the  following  day’s  prediction.  Finally,  the 
network’s  last  prediction  would  be  compared  to  the  actual  outcome.  This  forces 
two  constraints  upon  the  neural  net:  1)  it  must  learn  a  prediction  function  that  is 
consistent  or  smooth  from  day-to-day  and  2)  that  function  must  eventually  agree 
with  the  actual  outcome.  The  first  is  accomplished  by  forcing  each  prediction  to 
be  similar  to  the  prediction  following  it,  while  the  second  is  accomplished  by 
forcing  the  last  prediction  to  be  consistent  with  the  actual  outcome.  The  correct 
answer  is  propagated  from  the  final  prediction  to  the  first. 
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This  approach  assumes  that  the  state  of  the  environment  is  somewhat 
continuous  and  does  not  radically  change  from  one  point  in  time  to  the  next.  In 
other  words,  the  environment  is  predictable  and  stable.  If  we  accept  this 
assumption,  the  TD  approach  has  three  immediate  advantages: 

1.  It  is  incremental  and,  presumably,  easier  to  compute. 

2.  It  is  able  to  make  better  use  of  its  experience. 

3.  It  is  closer  to  the  actual  learning  behavior  of  humans. 

The  first  point  is  a  practical  as  well  as  theoretical  one.  In  the  weather  prediction 
example,  the  TD  algorithms  can  update  each  day’s  prediction  on  the  following 
day  while  traditional  algorithms  would  wait  until  Monday  and  make  all  the 
changes  at  once.  These  algorithms  would  have  to  do  more  computing  at  once 
and  require  more  storage  during  the  week.  This  is  an  important  consideration  in 
more  complex  and  data-intensive  tasks. 

The  second  and  third  advantages  are  related  to  the  notion  of  single-step  versus 
multi-step  problems.  Any  prediction  problem  can  be  cast  into  the  supervised- 
learning  paradigm  by  forming  input-output  pairs  made  up  of  the  data  upon  which 
the  prediction  is  to  be  made  and  the  final  outcome.  For  the  weather  example,  we 
could  form  a  pair  with  the  data  at  each  day  of  the  week  and  the  actual  outcome 
on  Monday.  This  pairwise  approach,  though  widely  used,  ignores  the  sequential 
nature  of  the  task.  It  makes  the  simplifying  assumption  that  its  tasks  are  single- 
step  problems:  all  information  about  the  correctness  of  each  prediction  is 
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available  all  at  once.  On  the  other  hand,  a  multi-step  problem  is  one  where  the 
correctness  of  a  prediction  is  not  available  for  several  steps  after  the  prediction  is 
made,  but  partial  information  about  a  prediction’s  correctness  is  revealed  at  each 
step.  The  weather  prediction  problem  is  a  multi-step  problem;  new  information 
becomes  available  on  each  day  that  is  relevant  to  the  previous  prediction.  A 
supervised-ieaming  approach  cannot  take  advantage  of  this  new  information  in 
an  incremental  way. 

This  is  a  serious  drawback.  Not  only  are  many,  perhaps  most,  real-world 
problems  actually  multi-step  problems,  but  it  is  clear  that  humans  use  a  multi- 
step  approach  to  learn.  In  the  course  of  moving  to  grasp  an  object,  for  example, 
humans  constantly  update  their  prediction  of  where  their  hands  will  come  to  rest. 
Even  in  simple  pattern-recognition  tasks,  such  as  speech  recognition — a 
traditional  domain  of  supervised  learning  methods — humans  are  not  faced  with 
simple  pattern-classification  pairs,  but  a  series  of  patterns  that  all  contribute  to 
the  same  classification. 


2.2  Derivation  of  the  TD  (X)  learning  algorithm 

In  the  following  two  subsections  we  derive  the  TD  (X)  learning  algorithm  (Sutton, 
1988).  First  we  introduce  a  temporal  difference  learning  procedure  that  is  directly 
derived  from  the  classical  general  delta  learning  rule  and  induces  the  same 
weight  changes.  With  this  basic  learning  procedure  defined,  we  expand  it  to 
encompass  the  much  larger  and  more  general  TD  (X)  learning  algorithm  which 
produces  weight  changes  that  are  different  than  any  supervised-learning 
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algorithm.  In  the  next  section,  we  explore  exactly  what  the  TD  (X)  procedures 
compute  and  how  this  differs  from  the  more  traditional  approaches. 

2.2.1  The  General  Learning  Rule 

We  consider  the  multi-step  prediction  problem  to  consist  of  a  series  of 
observation-outcome  sequences  of  the  form  j^tx2,...3c.,z.  Each  x,  is  a  vector 

representing  an  observation  at  time  t  while  z  is  the  actual  outcome  of  the 
sequence.  Although  z  is  often  assumed  to  be  a  real-valued  scalar,  z  is  not 
prevented  from  being  a  vector.  For  each  observation  in  the  sequence,  x,,  the 
network  produces  a  corresponding  output  or  prediction,  Pr  These  predictions 
are  estimates  of  z. 

As  noted  in  the  first  chapter,  learning  algorithms  update  adjustable  parameters  of 
a  network.  We  will  refer  to  these  parameters  as  the  vector  w.  For  each 
observation,  a  change  to  the  parameters,  Aw,,  is  determined.  At  the  end  of  each 

observation-outcome  sequence,  w  is  changed  by  the  sum  of  the  observation 
increments: 


*  =  *  +  X  Aw, .  (2.1) 

M»1 

This  leaves  us  with  the  question  of  how  to  determine  Aw,.  One  way  to  treat  the 
problem  is  as  a  series  of  observation-outcome  pairs,  (jcpz),(jt2,z)...(x,,,z),  and 
use  the  backpropagation  learning  rule: 


Aw,  =  a(z  -  P,)VWP, 


(2.2) 


16 


where  a  is  a  positive  value  affecting  the  rate  of  learning;  VWP,  is  the  vector  of 
partial  derivatives  of  P,  with  respect  to  w\  and  (z- P,)  represents  a  measure  of 
the  error  or  difference  between  the  predicted  outcome  and  the  actual  outcome. 
This  learning  rule  is  a  generalization  of  the  delta  or  Widrow-Hoff  rule  (Rumelhart 
et  al,  1986). 

This  is  a  clear  supervised  learning  algorithm  with  each  Aw,  depending  directly  on 

z.  This  falls  prey  to  the  disadvantages  noted  earlier.  To  convert  this  to  a 
temporal  difference  algorithm,  we  must  represent  the  error  (z-P,)  in  a  different 

way.  We  can  use  the  “telescoping  rule”  to  note  that: 

(z-P,)*£(PM-P,)  (2-3) 

if  P„+,  =  z-  Using  this,  we  can  combine  equations  (2.1)  and  (2.2)  to  produce  a 
temporal  difference  update  rule: 

<i  n 

w  =  w  +  ^Aw,  =  h>  +  £a(z  -  P,)VwP, 

/» i  i=i 

=  W  +  X  1  -  PkWwP, 

1=1  *=t 

*=i  i=i 

=w+jda(PM-P,)jlV.P,. 

»1  *=1 


In  other  words: 


Aw,  =  a(P,+l  -  P,  • 


1 


(2.4) 
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Note  the  incremental  nature  of  this  rule:  each  Aw,  depends  only  on  a  pair  of 
successively-determined  predictions  and  the  sum  of  past  values  of  VWP,,  which 
can  be  accumulated  with  each  observation. 


2.2.2  The  TD  (X)  learning  algorithm 

With  equation  (2.4),  Aw,  is  updated  in  such  a  way  that  any  difference  between 
P, U,  and  P,  affects  all  of  the  previous  predictions,  P,,P2 . P„  to  the  same  extent. 

For  some  problems,  however,  it  may  be  preferable  to  provide  some  way  to 
weight  the  gradients,  VWP„  so  that  more  recent  predictions  are  affected  the  most. 

To  this  end,  we  will  consider  an  exponential  weighting  with  recency,  in  which  the 
predictions  of  observation  vectors  occurring  k  steps  in  the  past  are  weighted 
according  to  new  parameter  A*  where  0  £  A  <  1 : 

Aw,  =  a(P„,  -  .  (2.5) 


Equations  (2.4)  and  (2.5)  are  equivalent  for  A  =  1 .  We  can  therefore  refer  to  (2.4) 
as  a  TD  (1)  algorithm  and  as  a  member  of  the  more  general  TD  (X)  family  of 
algorithms. 

When  A  <1,  TD  (X)  produces  weight  changes  that  are  different  than  the  more 
traditional  (2.4).  This  is  particularly  true  with  TD  (0),  when  Aw,  is  determined 
solely  by  the  difference  between  the  two  most  recent  observations: 

Aw,  =  a(P -  P,)VWP,.  (2.6) 
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2.3  MaxiirumvUkelihood  Estimation 


It  is  known  that  TD  (X),  for  0<A<1,  converges  asymptotically  to  the  ideal 
predictions — at  least  for  absorbing  Markov  processes  and  linearly  separable 
data — after  an  infinite  amount  of  experience  (Sutton,  1988;  Dayan  1991); 
however,  it  is  instructive  to  explore  exactly  what  the  TD  (X)  procedures  compute 
after  a  finite  amount  of  experience  and  to  contrast  this  with  the  more  traditional 
supervised  learning  procedures.  Following  Sutton  (1988),  we  will  concentrate  on 
the  differences  between  linear  TD  (1 ) — the  Widrow-Hoff  procedure — and  linear 
TD  (0)  on  linearly  separable  data,  since  this  is  where  differences  are  most  clear. 

After  a  finite  number  of  repeated  presentations,  it  is  well-known  that  TD  (1 ) 
converges  in  such  a  way  so  as  to  minimize  the  root  squared  error  between  its 
predictions  and  the  actual  outcomes  in  the  training  set  (Widrow  and  Stearns, 
1985).  But  what  does  TD  (0)  compute  after  a  finite  number  of  repeated 
presentations?  Suppose  that  one  knows  that  the  training  data  to  be  used  is 
generated  by  some  Markov  process.  What  might  be  the  best  predictions  on  such 
a  training  set? 

Probability  and  statistical  theory  tell  us  that  if  the  a  priori  distribution  of  possible 
Markov  processes  is  known,  the  optimal  predictions  on  a  training  set  can  be 
calculated  through  Bayes’  rule;  however,  it  is  difficult  to  justify  any  a  priori 
assumptions  about  this  distribution.  In  this  case,  mathematicians  use  what  is 
known  as  the  maximum-likelihood  estimate.  In  general,  the  maximum-likelihood 
estimate  of  the  process  that  produced  a  set  of  data  is  ihat  process  whose 
probability  of  producing  that  data  is  largest.  For  example,  assume  we  flip  a  coin 
50  times  and  see  a  head  41  of  those  times.  We  can  then  ask  for  the  best 
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estimate  of  the  probability  of  getting  a  head  on  the  next  coin  flip.  The  real  answer 
depends,  of  course,  upon  the  probability  of  having  a  fair  coin;  however,  absent 

this  a  priori  knowledge,  the  best  answer  in  a  maximum-likelihood  estimate  sense 

is  simply  82%  (or  — ). 

50 

In  the  sort  of  prediction  problems  addressed  by  TD  (X),  the  maximum-likelihood 
estimate  can  be  defined  simply.  If  each  terminal  observation  has  associated  with 
it  an  outcome  value,  then  the  best  prediction  for  some  state  i,  in  the  maximum- 
likelihood  estimate  sense,  is  the  expected  value  of  the  outcome  assuming  that 
the  observed  fraction  of  transitions  from  observation  state  i  to  each  of  the 
terminal  observation  states  is  the  correct  characterization  of  the  underlying 
process.  In  other  words,  if  seeing  a  particular  observation  state,  i,  always  leads 
us  to  a  particular  termination  observation  state,  j,  then  the  best  prediction  for  i  is 
the  outcome  value  associated  with  j.  The  TD  (0)  procedure  moves  toward  this 
maximum-likelihood  estimate  with  repeated  presentations  of  the  training  data 
(Sutton,  1988).  For  other  values  of  A:0  <  X  <  1,  it  is  harder  to  characterize  exactly 
what  is  happening;  however,  there  is  some  interpolation  between  the  maximum- 
likelihood  estimate  and  the  minimial  root  squared  error  calculated  by  the 
traditional  supervised  learning  procedure  (Dayan,  1991). 


2.4  Prediction  versus  Control 

From  Samuel’s  checkers-playing  program  to  Tesauro’s  backgammon  player, 
temporal  difference  learning  algorithms  have  been  used  with  some  success  in 
the  domain  of  games.  Learning  in  this  domain,  however,  is  not  just  a  matter  of 
prediction.  If  a  network  is  learning  to  play  a  multi-step  game  by  predicting  the 
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"goodness"  of  a  position,  a  neural  net  can  actually  exercise  control  over  the  next 
position  that  it  will  see  when  its  own  output  is  used  to  pick  the  next  “good" 
position.  In  other  words,  we  use  the  evaluation  function  that  the  network  is 
learning  to  chose  its  moves  in  the  game. 

We  will  avoid  becoming  bogged  down  in  the  details  of  any  particular  task  or 
game  at  this  point  by  describing  problems  of  learning  to  predict  and  control  as  the 
problem  of  finding  a  good  heuristic  function  for  searching  on  a  graph.  We  shall 
see  that  this  formalism  not  only  applies  to  domains  such  as  game  playing,  but 
also  to  domains  that  do  not  at  first  seem  to  fit  well  into  this  description,  such  as 
the  weather  prediction  problem. 


2.4.1  Searching  through  a  Graph 

Any  game  can  be  described  in  the  following  way:  a  directed  graph  made  up  of 
nodes  or  states,  slts2,... s„,  and  rules  for  moving  from  one  state  to  another  (the 

vertices).  Some  of  the  states  are  initial  states  and  some  are  terminating  states. 
If  we  assume  that  a  neural  net  has  no  built-in  knowledge  of  the  rules  and  is  only 
interested  in  learning  to  predict  the  “goodness”  of  a  given  position,  we  can  use 
the  following  algorithm  to  teach  it: 


3  ^ initial 

P  =  net_prediction(s) 

while  s  is  not  a  final  state 

nextjist  =  generate_next_states(  s ) 
s  =  best_state(next Jist) 

determine  VP  and  accumulate  the  sum  of  the  gradients, 
accumulate  the  difference  between  the  last  prediction  and  the 
prediction  of  the  new  state. 

P  =  net_prediction(j) 

accumulate  the  difference  between  the  final  prediction  and  the  actual 

value  for  the  final  state. 

■ 
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intermediate  states 

Figure  2.1 :  Using  ordered  depth-first  search  to  choose  a  path  in  state  space. 


In  short,  we  move  from  observation  to  observation,  using  the  neural  net’s  current 
idea  of  “good"  to  decide  among  the  next  possible  observation  states.  The 
function  generate_next_states()  embodies  the  “rules”  of  the  game  while  the 
function  best_state()  simply  uses  the  network’s  current  prediction  function  to 
provide  a  value  for  each  possible  next  state,  returning  the  “best”  one  (i.e.  the 
state  with  the  highest  or  lowest  prediction  value).  This  general  algorithm  has 
been  implemented  and  used  for  the  purposes  of  studying  the  test  cases  in  this 
thesis. 

Recasting  prediction  and  control  problems  in  this  manner  allows  us  to  view  these 
problems  as  a  variation  on  depth-first  search,  using  a  heuristic  function  to  order 
the  nodes.  The  goal  of  the  neural  net  then  is  to  learn  this  heuristic  function. 

2.4.2  Non-controlling  prediction  tasks 

This  approach  is  not  restricted  to  prediction-control  tasks.  Our  first  example,  the 
daily  prediction  of  the  likelihood  of  rain  sometime  in  the  future,  also  fits  into  this 
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model.  In  this  case,  the  generate_next_states()  function  simply  returns  the 
observation  for  the  next  day,  making  the  job  of  best_state()  somewhat  easy. 
Similarly,  if  we  wish  for  our  system  to  passively  observe  a  game,  as  opposed  to 
both  observe  and  control  its  direction,  generate_next_states()  can  simply  return 
the  next  position.  In  fact,  as  we  will  see  later  in  this  thesis,  the  approach  even 
works  for  restricted  cases  of  recurrent  networks  where  the  next  observation  state 
is  actually  the  prediction  from  the  network  itself. 

Although  the  theoretical  underpinning  for  TD  (X)  presented  by  authors  such  as 
Sutton  prove  the  equivalence  of  these  methods  to  their  supervised-learning 
counterparts  and  provide  a  strong  argument  for  their  superiority  in  some  specific 
cases,  there  are  several  practical  issues  left  unaddressed.  In  the  next  chapter, 
we  will  use  the  approach  above  to  identify  and  explore  some  of  these  practical 


Chapter  3 


3  Practical  Issues  in  TD  (X) 


In  this  chapter  we  forego  our  previously  theoretical  treatment  to  concentrate  on 
the  more  practical  questions  that  must  be  addressed  in  order  for  temporal 
difference  procedures  to  be  used  effectively.  Although  some  of  the  issues  we 
discuss  here  have  been  explored  in  some  detail  by  both  Tesauro  (1991)  and 
Sutton  (1984, 1988),  they  have  not  been  completely  addressed.  In  some  cases, 
their  explorations  has  raised  even  more  questions  that  have  remained  largely 
unanswered. 

We  categorize  these  issues  into  two  broad  groups:  algorithmic  and  task- 
dependent.  We  begin  with  algorithmic  considerations  but  concentrate  mostly  on 
the  task-dependent  issues  as  these  issues  better  define  the  kinds  of  problems 
that  are  likely  to  be  encountered  in  the  real  world. 
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3.1  Algorithmic  Considerations 


3.1.1  Credit  Assignment 

In  multi-step  problems,  the  sequence  of  states,  sl,s2,...sm,  are  dependent.  That 
is,  each  state,  s,,  bears  some  relationship  to  the  state,  that  preceded  it.  In  a 
game  like  chess,  for  example,  a  particular  position  constrains  the  positions  that 
follow  it.  If  no  pawns  are  present  on  the  board  in  some  position,  for  example,  no 
positions  that  follow  can  have  pawns  present.  Because  of  this  interdependence, 
it  may  be  difficult  to  determine  which  states  have  had  the  most  affect  on  the 
outcome.  Nevertheless,  after  all  the  states  have  been  seen  by  a  net  and  an 
outcome  signal  has  been  presented,  the  training  algorithm  must  apportion  credit 
to  each  stats,  determining  in  some  way  which  states  are  most  responsible  for  the 
final  outcome.  In  our  chess  example,  we  might  want  to  know  which  move  was 
our  worst  one  and  actually  contributed  most  to  a  loss.  This  is  known  as  the 
temporal  credit  assignment  problem. 

There  is  a  similar  structural  credit  assignment  problem.  Each  weight  parameter 
in  vv  contributes  in  some  way  to  the  network’s  prediction  and  there  must  be  some 
way  to  determine  which  parameters  are  most  responsible  for  a  correct  or 
incorrect  prediction.  There  are  various  schemes  for  determining  this  fairly  well, 
the  most  commonly  used  being  gradient  descent. 

By  contrast,  temporal  credit  is  often  impossible  to  determine.  TD  (X)  uses  the  X 
parameter  in  Equation  (2.4)  to  address  this  directly.  As  an  exponential  weighting 
scheme,  it  assumes  that  the  later  states  contribute  the  most  to  the  final  outcome 
or,  conversely,  that  the  final  outcome  should  affect  the  prediction  function’s  view 
of  the  last  states  the  most.  The  effect  on  earlier  states  will  “bubble”  back  after 


continued  iterations.  In  a  prediction-control  task,  this  corresponds  approximately 
to  a  depth  first  search:  the  values  of  the  states  that  are  furthest  from  the  root  are 
changed  first.  In  other  words,  the  same  path  in  the  search  space  is  followed  with 
only  the  last  state  in  the  sequence  changing.  When  all  the  last  states  have  been 
tried,  the  next-to-last  state  is  changed  and  all  the  pw...s  from  that  state  are 
explored.  This  continues,  “bubbling”  up  the  search  tree  until  an  optimal  path,  or 
heuristic  function,  is  found.  Of  course,  this  is  only  approximate.  The  states  may 
be  related  in  arbitrary  ways  and  may  seem  similar  enough  to  the  untrained 
network  that  a  change  to  the  predicted  value  of  the  last  state  in  a  sequence 
produces  a  similar  change  to  the  predicted  value  of  one  of  the  earlier  states. 
This  analogy  is  more  suited  to  the  searching  pattern  of  the  network  late  in  the 
learning  process. 

In  theory,  this  depth-first  search  could  be  changed  to  correspond  more  closely  to 
some  sort  of  breadth-first  search.  In  this  way,  states  furthest  from  the  final  states 
are  changed  first.  One  possible  way  to  accomplish  this  is  by  inverting  X  in 
Equation  (2.5)  and  further  restricting  its  range: 

Aw,  =  «(/>„, -/>,)£(£)' >„/>,.  (3.1) 

Another  possibility  would  involve  reversing  the  order  of  the  gradients: 

Aw,  =  «(/>„,-  />,)£  .  (3.2) 


It  appears  that  this  breadth-first  search  possibility  has  not  been  fully  explored. 
For  some  non-Markovian  problems,  including  the  kinds  of  recurrent  net  learning 
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tasks  presented  in  chapter  five,  this  kind  of  search  might  be  more  appropriate 
than  the  depth-first  approach. 


3.1.2  Tweaking  a  and  X 

As  with  most  training  algorithms,  TD  (X)  has  a  number  of  its  own  parameters  that 
can  be  changed  from  task  to  task,  most  notably  X  and  a.  In  practice,  it  may  be 
best  that  a  changes  over  time.  Initially,  large  values  might  help  bootstrap  the 
network,  but  as  the  net  approaches  a  more  stable  function,  this  parameter  can  be 
reduced  to  allow  the  net  to  fine-tune  itself.  Intuitively,  the  research  in  this  area 
from  the  supervised  learning  community  would  seem  to  apply. 

On  the  other  hand,  it  is  more  difficult  to  characterize  the  way  in  which  we  should 
choose  X.  Watkins  (1989)  has  pointed  out  that  in  choosing  the  value  of  X,  there 
is  a  trade-off  between  the  bias  caused  by  the  error  in  P  at  the  current  stage  of 
learning  and  the  variance  of  the  real  terminal  values,  z.  The  higher  the  value  we 
choose  for  X,  the  more  significant  are  the  values  of  P,  for  higher  values  of  r,  and 

the  more  effect  the  unbiased  terminal  values  will  have,  leading  to  higher  variance 
and  lower  bias.  On  the  other  hand,  P,  for  larger  t  will  have  less  significance  and 

the  unbiased  terminal  values  will  have  less  effect  if  we  lower  X.  This  leads  to 
smaller  variance  and  greater  bias. 

3.1.3  Convergence  of  TD  (X) 

Sutton  (1988)  and  Dayan  (1991)  have  proved  that  TD  (X)  converges  in  the  case 
of  a  linear  network  trained  with  linearly  independent  data  sets.  Unfortunately, 
linear  networks  are  of  limited  use.  More  practical  applications  require  the  use  of 
non-linearities.  In  this  case,  the  TD  (X)  algorithm  may  not  converge  to  a  locally- 
let  alone  globally— optimal  solution. 


27 


3.1.4  Completion  of  TD  (X) 

Even  in  cases  where  TD  (X)  will  converge,  it  is  unclear  how  long  this 
convergence  might  take.  The  state  space  upon  which  the  algorithm  searches 
may  be  infinite  or  incredibly  large  (as  is  the  case  with  chess).  Further,  in 
prediction  and  control  tasks,  the  paths  explored  by  the  algorithm  may  be  circular. 
For  example,  when  controlling  a  simulated  car,  the  network  may  end  up  at  some 
state  where  it  has  been  before  within  the  current  sequence  of  states.  Since  the 
prediction  function  is  not  updated  until  an  entire  sequence  is  generated  and 
concluded,  this  will  lead  to  infinite  repetition.  In  cases  where  this  is  possible, 
some  outside  agent  must  be  employed  to  terminate  an  infinitely  repeating 
sequence. 

3.1.5  Sequence  Length  and  the  Curse  of  Dimensionality 

The  fact  that  learning  time  on  networks  increases  exponentially  as  the  dimension 
of  the  input  increases  is  known  as  the  curse  of  dimensionality.  In  the  case  of  the 
multi-step  problems  that  temporal  difference  algorithms  attack,  this  problem  may 
extend  as  well  to  the  length  of  the  observation  or  state  sequences.  It  is  still 
unclear  exactly  how  well  the  TD  (X)  algorithm  scales  with  the  length  of  state 
sequences.  There  is  no  reason  to  believe  that  the  algorithm’s  performance  will 
not  degrade  exponentially. 


3.2  Task  Dependent  Considerations 

Many  of  the  practical  issues  that  we  have  described  above  are  best  understood 
in  the  context  of  specific  kinds  of  tasks.  It  is  the  type  of  task  presented  to  the 
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neural  network  that  drives  many  of  our  algorithmic  decisions.  Good  values  of  X 
and  a  will  differ  radically  from  problem  to  problem,  for  example.  In  order  to  better 
understand  the  algorithmic  considerations,  we  present  a  few  types  of  tasks  and 
discuss  their  effects  upon  the  particulars  of  the  TD  (X)  algorithm. 

3.2.1  Prediction  and  Partial  Control 

Throughout  this  chapter  and  in  chapter  two,  we  noted  that  many  problems  of 
prediction  are  actually  problems  of  both  prediction  and  control:  It  is  worth  noting 
that  in  some  problem  domains,  such  as  game  theory,  we  are  interested  in 
problems  of  prediction  and  only  partial  control.  In  a  game  like  chess  for  example, 
a  network  might  learn  by  playing  opponents,  some  of  which  are  considerably 
more  skilled,  as  opposed  to  just  playing  itself. 

In  this  case,  the  network  has  only  partial  control  of  its  environment.  It  makes  a 
prediction  which  is  used  to  choose  the  next  state,  but  then  another  agent, 
perhaps  an  adversarial  one,  chooses  the  state  that  follows.  What  should  the 
form  of  the  sequence  that  it  sees  be?  r 

One  possibility  is  that  it  should  only  see  the  states  for  which  it  has  predicted 
values.  This  would  not  allow  it  to  take  advantage  of  the  information  of  the  states 
chosen  by  the  other  agents.  On  the  other  hand,  the  network  might  also  be 
presented  these  states  along  with  the  other  agent’s  predictions. 

In  the  first  case,  the  sequence  seen  by  the  network  is  roughly  halved,  meaning 
that  the  importance  of  each  state  in  changing  the  parameters  of  the  network  is 
roughly  squared.  How  would  this  affect  the  learning  rate  and  the  ability  to 
converge?  Can  this  be  overcome  by  simply  using  a  value  for  X  that  is  the  square 


root  of  the  value  we  would  normally  use?  More  importantly,  does  skipping  every 
other  state  affect  the  validity  of  the  assumption  that  there  is  some  temporal 
continuity? 

In  the  second  case,  it  is  unclear  how  the  interaction  of  these  other  prediction 
values  with  the  net’s  predictions  will  affect  learning.  They  could  simply  act  as  a 
straight-forward  supervised  learning  signal,  as  if  the  network  were  presented  with 
input-output  pairs,  or  they  might  have  much  more  complex  effects. 

In  either  case,  the  impact  would  seem  to  depend  greatly  upon  both  the  absolute 
accuracy  of  the  external  agent  and  its  accuracy  relative  to  the  network’s.  In  the 
case  of  a  game,  the  relative  accuracy  takes  three  forms:  a  vastly  inferior 
opponent,  a  vastly  superior  opponent  and  an  opponent  of  about  the  same  skill. 
With  a  vastly  inferior  opponent  and  a  constant  stream  of  positive  feedback,  the 
net  might  find  a  solution  that  only  leams  to  win  against  poor  play  while  with  a 
vastly  superior  opponent  and  continual  negative  feedback,  the  net  might  simply 
leam  to  lose,  predicting  all  states  to  be  equally  bad. 

Further,  a  huge  disparity  in  ability  could  lead  to  an  unstable  search.  Since 
successive  states  would  probably  differ  widely  from  one  to  another,  the  series  of 
predictions  would  also.  This  would  likely  prolong  learning  and  increase  the 
possibility  of  poor  behavior.  With  an  opponent  of  about  the  same  skill,  it  seems 
more  likely  that  the  network  would  slowly  improve  its  performance;  however,  this 
requires  that  the  opponent’s  skill  level  changes  to  keep  pace  with  the  network’s 
ability.  Otherwise,  the  network  would  end  up  in  one  of  the  situations  described 
above.  For  the  purposes  of  our  examples,  this  is  achieved  through  self-play. 
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With  the  net  playing  both  sides,  the  probability  of  each  side  remaining  at  equal 
skill  levels  is  increased. 

An  interesting  question  involves  the  ability  of  TD  (X)  to  generalize  well  in  a  task 
involving  only  partial  control.  For  example,  if  the  network  only  predicts  values  for, 
say,  odd-numbered  states  white  an  adversary  predicts  values  for  even-numbered 
states,  the  network  may  learn  a  prediction  function  that  only  applies  to  the  odd- 
numbered  states  to  which  it  as  been  exposed.  It  is  possible,  for  instance,  that  the 
odd-numbered  states  have  their  own  substructure  and  a  different  function 
optimizes  the  predictions  for  the  even-numbered  states.  If  the  net  then  “switches 
sides”  with  the  adversary  and  attempts  to  predict  even-numbered  states,  it  may 
perform  poorly. 

3.2.2  Prediction  and  Control  Revisited 

As  touched  on  above,  even  the  case  of  complete  control  by  the  network  is 
different  from  the  problem  of  pure  prediction.  By  controlling  its  own  actions,  the 
net  runs  the  risk  of  finding  a  self-consistent  but  sub-optimal  predictor-controller. 
This  problem  has  not  been  addressed  theoretically  and  may  be  beyond  the 
scope  of  TD  (X). 

3.2.3  Relative  and  Absolute  Accuracy 

One  potential  problem  with  the  TD  (X)  learning  rule  is  that  it  is  designed  to  teach 
a  network  to  accurately  predict  a  final  outcome,  z.  Many  times  in  prediction- 
control  problems,  however,  we  are  really  more  interested  in  the  network’s  ability 
to  choose  among  several  alternatives.  For  this,  it  does  not  need  to  provide  an 
accurate  estimate  of  the  actual  “goodness”  of  a  state  so  much  as  provide  an 
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Actual 

Net  1  Net  2  values 


Figure  3.1 :  Network  1  predicts  values  that  are  very  close  to  the  actual  values  while  network 
2  has  a  sum  of  square  error  that  is  more  than  twelve  times  as  large;  however,  because 
network  2  has  done  a  better  job  of  ordering  the  states,  it  choose  the  best  state. 


estimation  that  accurately  orders  each  state;  however,  the  method  by  which 
feedback  is  provided  to  the  network  is  not  always  conducive  to  this  goal.  In  fact, 
this  method  may  be  counter-productive  because  small  errors  in  absolute 
accuracy  can  lead  to  very  large  errors  in  relative  accuracy. 

Similarly,  since  the  error  signal  is  measuring  the  absolute  accuracy  of  the 
network’s  prediction  instead  of  its  relative  accuracy,  it  is  difficult  to  analyze  and 
determine  how  well  the  algorithm  is  really  doing.  For  an  example  of  these 
problems,  see  figure  3.1. 

3.2.4  Random  Tasks  and  Noise 

It  is  worth  noting  that  Sutton  (1988)  and  others  have  performed  analysis  with 
noise-free,  deterministic  tasks.  Although,  Tesauro  (1991)  has  explored  teaching 
a  net  to  play  backgammon — a  game  which  involves  randomness  or  noise  in  the 
form  of  dice — it  is  not  yet  know  how  the  introduction  of  other  kinds  of  noise 
affects  TD  (X). 
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3.2.5  Representation 

As  with  all  neural  network  training  algorithms,  temporal  difference  procedures  are 
sensitive  to  the  representations  chosen  for  both  the  data  and  the  output. 
Representations  can  be  designed  so  that  they  explicitly  contain  a  wealth  of 
relevant  information  or  can  be  designed  plainly,  so  that  a  neural  net  attempting  to 
generalize  must  somehow  learn  to  represent  a  great  deat  of  some  underlying 
structure.  Without  enough  information  in  the  representation  it  may  be  extremely 
difficult  to  generalize. 

A  representation  issue  that  is  of  particular  importance  to  TD  (X)  is  the  linear 
dependence  of  the  observation  vectors.  Da/an  (1991)  shows  that  if  the 
observation  vectors  presented  to  the  network  are  not  linearly  independent,  then 
TD  (X)  for  A  *1  converges  to  a  solution  that  is  different  than  the  least  means 
squares  algorithm,  at  least  for  linear  networks.  In  this  case,  using  the  inaccurate 
estimates  from  the  next  state,  P(x,+l),  to  provide  an  error  signal  for  the  estimate 
of  the  current  state,  P(x,),  may  not  be  harmless.  With  linearly  dependent 
observation  vectors,  these  successive  estimates  become  biased  on  account  of 
what  Dayan  has  deemed  their  “shared”  representation.  The  amount  of  the  extra 
bias  between  the  estimates  is  related  to  the  amount  of  their  sharing  and  the 
frequency  with  which  the  transitions  occur  from  one  state  to  the  next.  So,  while 
TD  (X)  for  A  *  1  will  still  converge,  it  will  be  away  from  the  “best”  value  to  a 
degree  determined  by  the  matrix: 

(i-(l-A)Qfl-AQ]-1), 

where  Q  is  the  square  matrix  of  transition  probabilities.  It  remains  unclear 
exactly  how  this  affects  the  usefulness  of  TD  (X)  for  typical  problems. 
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3.2.6  Lookup  Tables 

If  there  are  enough  parameters  available  for  a  task,  a  network  can  act  as  a 
lookup  table  by  explicitly  storing  the  values  of  the  training  data.  In  the  case  of 
RBF  networks,  this  requires  one  center  or  RBF  node  for  every  member  of  the 
training  data. 

Sutton’s  proof  for  convergence  relies  on  a  lookup-table  approach  and  therefore 
requires  that  every  possible  state  be  visited  an  infinite  number  of  times.  This  is 
impractical  with  real  world  problems. 

3.2.7  Maximum-Likelihood  Estimates  and  Non-Markovian  Tasks 

As  seen  in  section  2.3,  Sutton’s  convergence  and  optimality  proofs  rely  on  the 
assumption  that  the  tasks  are  absorbing  Markov  processes.  For  these  kinds  of 
Markovian  processes,  TD  (X)  computes  a  maximum-likelihood  estimate,  arguably 
a  desirable  feature.  On  the  other  hand,  it  is  unclear  how  useful  these  procedures 
can  be  with  non-Markovian  processes.  For  example,  we  may  have  a  task  where 
the  observation  state  c  in  the  sequence  (...,a,c,...)  should  be  viewed  differently 
than  when  the  same  observation  c  is  proceeded  by  a  different  state,  (..., b,c,...). 

If  there  no  straightforward  and  computationally  tractable  way  of  encoding  this  in 
the  data  or  some  way  for  TD  (X)  to  discover  it — and  the  latter  is  probably  not  the 
case — then  TD  (X)  may  produce  very  inaccurate  predictions  and  have  no  way  of 
correcting  them. 
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Chapter  4 


4  Example:  Tic-Tac-Toe 


In  this  chapter  and  the  next,  we  explore  several  case  studies  that  ground  some  of 
the  issues  that  have  been  raised  in  the  previous  chapters  in  particular  contexts. 
Each  experiment  in  this  chapter  used  the  approach  that  was  outlined  in  section 
2.4.1. 

4.1  Tic-tac-toe 

Our  first  case  study  involves  the  game  tic-tac-toe.  Tic-tac-toe  is  a  two-player 
game  played  on  a  three-by-three  grid.  Each  player  is  represented  by  a  token, 
usually  X  and  O.  For  our  purposes,  we  will  assume  that  X  always  goes  first. 
Players  take  turns  placing  a  token  in  an  empty  spot  on  the  grid.  A  player  has 
won  the  game  when  she  has  placed  three  of  her  tokens  in  a  row,  either  vertically, 
horizontally  or  diagonally.  The  game  is  a  draw  if  all  nine  spots  are  filled  and 
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Figure  4.1 :  a)  shows  a  win  for  X,  b)  shows  a  “fork*  for  X.  meaning  that 
X  has  two  ways  to  win  on  the  next  move  and  figure  c)  shows  a  draw. 


neither  player  has  placed  three  of  her  tokens  in  a  row.  Some  game  positions  are 
shown  in  figures  4.1  and  4.2. 


Tic-tac-toe  is  a  completely  deterministic  game  with  a  finite  number  of  states.  In 
fact,  the  size  of  the  state  space  can  be  reduced  to  only  a  few  hundred  by  taking 
advantage  of  the  game’s  symmetrical  nature,  best  play  by  both  sides  will  always 
result  in  a  draw.  The  following  simple  set  of  rules  describes  best  play  in  any 
position: 


if  you  can  place  a  token  immediately  and  win,  do  so. 
if  your  opponent  can  win  on  her  next  move,  block  her. 
if  you  can  “fork”  (i.e.  place  a  token  such  that  you  have  two  ways  to  win  on 
your  next  move),  do  so. 

if  you  can  place  a  token  on  a  square  so  as  to  force  a  block  on  the  next 
move  by  your  opponent  and  that  by  making  that  block,  she  will  not 
“fork"  you,  place  your  token  on  that  square.  Prefer  comer  squares 
to  non-corner  squares, 
if  your  opponent  can  fork,  block  her. 
if  the  center  position  is  free,  take  it. 
if  a  corner  is  free,  take  it. 

if  none  of  the  other  rules  apply,  place  your  token  randomly. 


For  each  of  these  rules,  it  is  possible  that  more  than  one  state  will  satisfy  the 
condition.  In  this  case,  a  player  can  simply  pick  one  randomly. 
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Figure  42:  Each  of  these  positions  is  equivalent  as  they  can  ail  be  rotated 
or  flipped  into  the  same  prototypical  positions.  The  rightmost  position,  for 
example,  can  be  rotated  clockwise  90*  and  become  the  leftmost  position. 

4.2  Practical  Issues  In  Learning  Tlc-tac-toe 

As  a  two-player  adversarial  game,  tic-tac-toe  provides  an  opportunity  to  explore 
many  of  the  issues  that  have  been  described  in  earlier  chapters.  In  particular,  it 
allows  us  not  only  to  test  the  ability  of  TD  (X)  to  solve  prediction-control  problems 
but  to  compare  its  ability  to  solve  prediction  tasks  that  allow  complete  control  with 
prediction  tasks  allowing  only  partial  control.  To  this  end,  we  shall  see  how  well  a 
network  can  learn  by  self-play  as  well  as  by  playing  against  an  opponent  who,  in 
our  experiments,  always  follows  an  optimal  strategy.  We  will  also  explore  the 
ability  of  a  these  latter  nets  to  generalize,  by  testing  their  ability  to  “switch  sides” 
and  provide  accurate  predictions  for  states  that  they  have  not  seen. 

As  a  game  with  a  relatively  small  number  of  states,  the  network  should  be  able  to 
visit  almost  every  state  in  the  course  of  learning.  Further,  since  this  game  has  a 
relatively  short  number  of  states  in  any  sequence — at  most  nine  with  self-play 
and  five  against  an  opponent — we  should  not  have  to  concern  ourselves  with  this 
particular  version  of  the  curse  of  dimensionality. 

Although  tic-tac-toe  is  a  relatively  simple  game  to  team  to  play  well,  there  are 
some  key  positions  that  are  very  bad.  For  example,  on  the  first  two  moves,  a 


Figure  4.3:  If  X  places  her  token  in  the  center  square  on  the  first  move,  O  cannot 
place  hers  in  a  non-comer  square.  X  is  able  to  force  a  “fork*  no  matter  which  non¬ 
comer  square  O  chooses  because  of  the  symmetrical  nature  of  the  game. 


player  facing  an  expert  opponent  is  guaranteed  a  loss  by  placing  an  O  in  a  non¬ 
corner  square  in  response  to  an  X  being  placed  in  the  center  square  (see  figure 
4.3).  Because  the  “good”  and  “bad”  moves  are  somewhat  clear-cut  and  always 
deterministic,  it  is  easier  to  determine  the  quality  of  the  ordering  capability  of  an 
evaluation  function,  independent  of  its  absolute  “error.” 

Finally,  it  is  worth  noting  that  tic-tac-toe  positions  contain  a  great  deal  of 
structure.  Thousands  of  positions  are  collapsible  into  several  hundred  by 
symmetry  alone.  A  network  learning  the  most  compact  function  for  this  problem 
would  find  some  way  to  represent  this  information. 


4.3  Experiments  with  Tic-tac-toe 

Four  experiments  were  conducted.  One  experiment  involved  a  network  learning 
to  play  tic-tac-toe  through  self-play  while  the  other  networks  learned  by  playing 
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against  an  opponent  employing  the  strategy  described  in  section  4.1,  randomly 
choosing  among  all  the  available  best  moves  in  any  given  position.  Against  this 
“perfect"  opponent,  one  network  always  played  X;  another  always  played  O  and 
the  last  alternated  between  X  and  O. 

Each  experiment  used  a  GRBF  network  with  200  centers.  Each  of  the 
experiments  began  with  the  same  initial  parameter  values.  The  first  425 
iterations  were  treated  as  a  bootstrapping  phase.  In  this  phase  all  but  one  of  the 
initial  board  positions  had  eight  of  their  slots  filled.  The  exception  was  the  blank 
board.  The  next  575  iterations  added  all  the  board  positions  with  seven  slots 
filled.  For  the  final  1000  iterations,  we  chose  one  fifth  of  all  the  possible  board 
configurations  for  starting  positions. 

For  all  of  these  experiments,  the  value  of  X  was  set  to  0.6.  The  value  of  a  was 
decreased  after  each  phase,  with  the  assumption  that  accuracy  would  suffer  for 
large  values  of  a  during  the  last  phases  of  learning.  This  assumption  was  verified 
by  some  initial  results. 

Each  board  position  was  represented  by  a  ten-dimensional  vector.  The  first  nine 
components  represented  the  tokens  placed  in  each  of  the  nine  squares  on  the 
tic-tac-toe  board,  numbered  from  left  to  right  and  top  to  bottom  while  the  last 
component  determined  which  player’s  turn  it  was  to  move.  X  was  always 
represented  by  -1 ,  O  by  1  and  an  empty  square  by  the  value  0. 

The  output  was  a  vector  in  three  dimensions,  representing  the  probability  of  X 
winning,  the  probability  of  O  winning  and  the  probability  of  a  draw,  respectively. 
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The  besLstateQ  function  computed  a  scalar  value  from  these  three  probabilities 
to  determine  the  ordering  of  possible  states: 


£  =  Pr(i)-PrO)  + 


Pr(draw) 

2 


(4.1) 


where  i  represents  the  player  with  the  current  turn  and  j  her  opponent.  In  the 

case  of  a  certain  win  for  i,  £  =  1;  when  i  is  certain  to  lose,  £  =  -l;  and  when  a 
draw  is  certain,  £  =  This  tends  to  make  drawing  much  more  like  winning  than 

losing.  In  the  case  of  tic-tac-toe,  where  best  play  by  both  sides  always  leads  to  a 
draw,  this  seems  like  a  desirable  trait. 


4.4  Tlc-tac-toe  Results 

There  are  several  general  results:  the  neural  network  trained  through  self-play 
performs  the  best  against  the  optimal  opponent;  each  network  learned  to 
accurately  predict  most  drawing  positions;  all  networks  perform  best  when 
playing  X;  none  of  the  networks  learned  the  underlying  symmetrical  structure  of 
the  tic-tac-toe  positions;  and  while  the  networks  may  play  well  against  a  rational 
opponent,  an  irrational  opponent  can  defeat  them. 

It  is  not  too  surprising  that  the  networks  teamed  to  play  better  when  playing  X. 
The  design  of  the  first  bootstrap  phase  only  allowed  for  one  move,  and  that  move 
was  always  by  player  X.  Therefore,  there  was  a  great  deal  of  experience— even 
for  the  network  that  played  O — with  the  positions  for  the  first  player. 
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The  design  of  the  bootstrap  phase  also  provided  early  experience  for  positions 
leading  to  draws  since  many  of  those  bootstrap  positions  necessarily  lead  to  a 
draw.  The  next  phase — which  only  allowed  for  a  maximum  of  two  moves — often 
forced  draws  as  well.  Given  this  initial  bias  and  the  fact  that  most  lines  of  play  in 
tic-tac-toe  lead  to  draws,  it  is  also  unsurprising  that  all  the  nets  learned  to  predict 
draws  fairly  well. 

The  choice  of  input  representation  was  a  deliberate  one,  designed  to  provide  the 
minimum  of  information.  As  such  it  may  not  be  surprising  that  the  networks  did 
not  leam  some  way  to  transform  the  data  and  recognize  symmetries.  It  is 
possible  that  allowing  a  network  to  change  its  W  matrix  (thus  moving  from  a 
GRBF  to  a  HyperBF  network)  might  facilitate  this  somewhat. 

Even  when  a  network  played  well  against  a  good  opponent,  it  would  fail  against  a 
bad  opponent.  When  playing  against  these  TD  (X)  networks,  an  opponent  had 
only  to  ignore  the  network  when  it  had  a  pending  win.  Instead  of  blocking  that 
win,  an  opponent  would  be  better  served  to  set  up  her  own  win.  In  some 
positions,  the  network  would  rather  block  her  thrust  than  actually  win  while  in 
others  it  would  simply  make  some  other  move.  One  way  to  explain  this  is  to 
assert  that  the  networks  learned  to  play  defensively,  preferring  “not  losing"  to 
“winning." 

It  may  seem  that  some  of  this  is  due  to  the  form  of  equation  (4.1);  however,  a 
choice  function  that  makes  the  network  a  more  “aggressive”  player  by  placing 
the  value  of  drawing  exactly  between  the  values  for  winning  and  losing: 

E  =  2Pr(i)  -  PrO-)  +  (4.2) 
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Figure  4.4:  The  percentage  of  times  each  net  drew  against  the  optimal  player.  Since  the 
networks  were  deterministic  there  are  only  a  small  number  of  possible  games  each  net  can 
play  against  the  optimal  strategy.  For  example,  the  self-playing  net  as  O  always  responded 
to  an  X  in  the  center  of  the  board  by  placing  an  O  in  the  bottom  left-hand  comer.  From  here, 
given  the  network's  level  of  play  and  the  strategy  used  by  our  optimal  player,  there  were  only 
three  possible  lines  of  play  that  could  follow. 


seemed  to  have  no  effect  on  the  lines  of  play  chosen  by  any  of  the  networks,  at 
least  after  many  training  epochs. 

4.4.1  Self-playing  networks 

As  a  matter  of  strategy,  the  self-playing  network  learned  to  place  its  token  in  the 
center  position  when  playing  X.  The  four  possible  next  “best”  moves  are  all 
symmetric  (see  figure  4.2);  however,  the  network  only  learned  to  draw  against 
three  of  them.  In  fact,  the  network  learned  non-symmetrical  strategies  for  each  of 
the  positions  it  learned  to  draw  against. 

As  O,  the  network  played  its  token  in  the  bottom  left-hand  corner,  one  of  the  four 
best  moves  against  an  X  in  the  center;  however,  it  could  only  manage  to  draw 
one  third  of  the  time.  It  is  worth  noting  that  one  of  the  losses  is  due  to  its  inability 
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Figure  4.5:  Results  of  each  of  the  networks  playing  against  each  other  and  themselves.  Although 
the  self-playing  net  seems  to  be  the  best  player  against  an  optimal  opponent,  it  manages  to  lose 
against  the  sub-optimal  networks. 

to  prevent  a  fork.  It  seems  that  it  has  not  learned  to  prefer  corner  squares  to 
non-corner  squares  when  forcing  a  block. 

In  short,  it  appears  that  the  network  has  learned  the  importance  of  the  center  and 
the  comer  squares  fairly  well  in  the  most  common  lines  of  play,  mastering  some 
of  the  opening.  While  it  seems  to  have  learned  much  less  about  middle  game 
strategies  such  as  forking,  its  choices  for  second  moves  tend  to  lead  to  a  series 
of  forced  moves.  When  it  follows  the  forced  moves,  it  tends  to  draw. 

The  network  taught  by  self-play  learned  to  play  against  the  optimal  player  better 
than  any  of  the  other  networks.  It  is  possible  that  the  network  performs  best 
because  it  has  had  the  most  “stable"  opponent  and  the  most  stable  set  of  inputs. 
Each  state  was  similar  to  the  ones  surrounding  it.  On  the  other  hand,  the 
networks  facing  an  opponent  saw  vastly  different  states  from  time  point  to  time 
point  and  almost  always  had  large  temporal  errors.  There  is  evidence  that 
suggests  that  it  is  difficult  to  leam  well  under  these  circumstances  (Poggio, 
personal  correspondence). 
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Figure  4.6:  The  percentage  of  times  the  seif-playing  network  drew  against  the  optimal  player  as  a 
function  of  the  number  of  epochs  the  network  has  experienced.  Although  the  percentage  of  games 
drawn  seemed  to  level  off,  the  prediction  function  used  by  the  network  continued  to  evolve. 


As  noted  before,  these  networks  often  had  difficulty  playing  against  an  opponent 
who  was  irrational.  In  fact,  even  though  the  self-playing  network  played  best 
against  the  optimal  opponent  (see  figure  4.4),  it  would  sometimes  lose  against 
the  networks  which  played  "worse"  (see  figure  4.5). 


It  is  worth  noting  that  the  self-playing  network  seemed  to  be  the  only  one  to 
choose  moves  from  among  the  most  positive  possibilities  instead  of  from  the 
least  negative  possibilities.  For  example,  when  choosing  among  the  best  first 
move  for  X,  this  type  of  network  was  the  only  one  whose  prediction  function 
generated  numbers  with  positive  values  for  equation  (4.1).  Each  of  the  others 
generated  only  negative  values.  This  is  due  in  part  because  the  self-playing 
network  learned  to  predict  values  near  zero  for  the  probabilities  of  either  X  or  O 
winning  and  values  nearer  to  one  for  the  probability  of  a  draw.  It  is  also  due  to 
this  network  coming  closest  to  representing  its  output  as  probabilities.  The  other 
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networks  produced  outputs  with  components  as  high  as  2.9  and  values  for 
equation  (4.1)  as  low  as  *3.1. 


4.4.2  Opponent-playing  networks 

None  of  the  networks  trained  against  the  perfect  opponent  “learned  to  lose"  in  the 
sense  that  they  never  drew  or  learned  a  function  that  drove  all  states  to  “bad" 
values.  Analysis  suggests  that  the  depth-first  analogy  fits  well  in  this  case.  A 
game  sequence  was  repeated  several  times  with  the  last  states  in  the  sequence 
given  lower  and  lower  scores.  Eventually,  the  last  state  received  a  score  so  low 
that  the  sequence  changed  at  that  point.  If  none  of  these  changes  led  to  good 
play,  the  states  earlier  in  the  sequence  had  accumulated  enough  changes  that 
the  sequence  changed  at  those  points  instead.  Eventually,  the  network  could  not 
help  but  to  stumble  upon  a  good— or  rather  a  non-disastrous — line  of  play. 

Even  though  each  of  these  networks  learned  different  prediction  functions,  each 
network  learned  to  order  states  in  more  or  less  the  same  way.  Thus,  each  net 
piayed  the  same  games  against  each  other  and  the  optimal  opponent.  This  is 
somewhat  counter-intuitive  since  the  X-playing  network  and  the  O-playing 
network  were  exposed  to  different  lines  of  play. 

Like  the  self-playing  network,  each  of  these  nets  learned  to  play  X  in  the  center 
as  a  first  move.  Their  prediction  functions  and  the  optimal  strategy  allowed  for 
six  possible  lines  of  play.  The  opponent-playing  networks  could  only  draw  in  one 
of  these  lines.  Of  the  five  losing  lines  of  play,  the  networks  lost  to  a  fork  only 
once.  In  general,  the  networks  failed  to  block  immediate  wins.  This  was  as  true 
for  the  X-playing  network  as  it  was  for  the  other  networks. 
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This  same  problem  surfaced  with  these  networks  when  they  played  O.  Even 
though  the  networks  played  a  comer  square  as  a  first  move  against  an  X  in  the 
center,  they  simply  never  blocked  immediate  wins.  Therefore,  each  network  lost 
every  game  playing  O,  including  the  O-playing  network. 

Many  of  these  problems  seem  to  be  traceable  to  a  simple  phenomenon.  The 
probability  predictions  for  the  final  positions  are  much  more  accurate  than  the 
early  predictions.  It  appears  that  the  sequence  length  of  the  states  for  these 
networks  was  short  enough  that  the  negative  feedback  for  the  last  states  had  a 
strong  effect  on  the  middle  states.  Useful  signals  about  the  final  states  in  a 
sequence  leading  to  an  evaluation  function  that  would  choose  appropriate  final 
moves  was  confounded  by  these  signals  being  associated  with  middle  states  too 
strongly.  Instead  of  minute  changes  in  the  evaluation  function  and  incremental 
improvement,  the  network  experienced  large  changes  and  played  sometimes 
vastly  different  games  from  training  epoch  to  training  epoch.  As  noted  before 
learning  is  often  difficult  in  these  circumstances.  This  is  especially  noticeable 
with  the  network  that  alternated  play  between  X  and  O.  This  network  saw  very 
different  lines  of  play  over  a  short  time  and  developed  correspondingly  different 
evaluation  functions,  sometimes  oscillating  between  evaluations. 

In  theory,  these  opponent-playing  networks  could  eventually  play  as  well  as  the 
self-playing  network  against  expert  play.  The  depth-first  search  nature  of  the 
training  still  keeps  the  network  away  from  learning  to  lose.  Unfortunately,  even  if 
the  networks  can  improve  their  play  against  the  optimal  opponent,  it  appears  that 
they  will  take  much  longer  than  the  self-playing  network  and  follow  too  many 
fruitless  paths  in  the  interim. 


46 


For  at  least  the  first  several  series  of  training  phases,  self-play  seems  superior. 
Training  a  network  by  having  it  play  against  an  optimal  or  near-optimal  opponent 
might  not  be  useful  until  a  self-trained  net  has  already  learned  a  fairly  good 
evaluation  function  on  its  own.  At  this  point,  with  a  sufficiently  small  learning  rate, 
occasional  negative  feedback  might  help  to  jostle  it  from  a  self-consistent  but 
suboptimal  evaluation  function. 
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Chapter  5 


5  Example:  Recurrent  Networks 

In  this  chapter,  we  use  the  problem  of  simple  segmentation  to  explore  the  use  of 
TD  (a)  with  neural  nets  that  are  not  structured  as  feedforward  networks— namely 
a  restricted  subclass  of  so-called  recurrent  networks — in  order  to  learn  complex 
functions  that  can  be  described  as  the  iteration  of  simpler  functions.  Each 
experiment  in  this  chapter  uses  the  approach  outlined  in  section  2.4.1. 

5.1  Function  Iteration 

It  is  sometimes  possible  to  describe  a  complex  function  as  the  iteration  of  a 
simpler  function: 

F(x)  =  lim  /"(*).  (5.1) 
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Figure  5.1 :  A  HyperBF  network  with  recurrent  connections. 


Phrased  in  the  language  of  a  neural  network,  the  outputs  of  the  network  are  fed 
back  into  the  inputs.  Instead  of  each  unit  in  a  network  calculating  a  value  once,  a 
unit  updates  its  value  each  time  its  input  changes.  There  is  a  class  of  networks 
that  uses  this  principle  called  recurrent  networks  (see  figure  5.1). 

These  networks  are  equivalent  to  feedforward  networks  in  that  they  can 
approximate  any  function  arbitrarily  well.  Beyond  this,  they  have  been  used  for 
various  tasks,  including  dimensionality  reduction,  with  some  success  (Jones, 
1992). 

This  formulation  of  a  function  can  be  useful  for  other  reasons  as  well.  For  some 
problems,  f(x )  may  be  an  easier  function  to  learn  than  F( x)  and  this  ease  of 
learning  is  worth  the  trade-off  in  computation  time,  in  other  words,  while  F{x)  is 
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leamable,  it  may  be  extremely  difficult  for  backpropogation  or  other  methods  to 
tune  the  parameters  of  the  network  in  order  to  duplicate  the  function.  On  the 
other  hand,  f(x)  may  be  a  simpler  function  to  emulate. 

These  recurrent  networks  raise  serious  questions  of  their  own.  For  example,  it  is 
not  guaranteed  that  the  outputs  of  a  particular  recurrent  network  will  become 
stable  and  cease  changing.  However,  these  problems  are  beyond  the  scope  of 
this  thesis  and  so  we  will  only  concern  ourselves  with  the  case  where  the  output 
of  the  network  achieves  a  fix  point. 


5.2  Segmentation 

Segmentation  is  the  problem  of  taking  a  vector  or  matrix  of  values  and  returning 
a  vector  or  matrix  of  equal  size  that  groups  similar  values.  In  segmenting  an 
image,  for  example,  an  algorithm  would  take  as  input  a  matrix  of  pixel  values  and 
output  a  transformed  image  where  each  pixel  in  an  object  has  the  same  value.  It 
is  assumed  that  adjacent  objects  in  the  original  image  are  colored  differently 
enougf  .hat  we  can  distinguish  between  them.  If  this  is  not  the  case,  the 
algorithm  may  blend  adjacent  objects  and  produce  an  output  image  that  looks 
like  a  single  large  object. 

In  this  chapter,  we  will  explore  two  methods  of  segmentation.  One  very  simple 
approach  is  to  decide  the  range  of  pixels  values  that  are  possible  and  to  drive  all 
values  below  some  threshold  to  the  minimum  value  while  driving  all  values  above 
that  threshold  to  the  maximum  value  (see  figure  5.2).  This  technique  turns  a 
"color”  image  into  a  black  and  white  image.  Determining  the  periphery  or  edges 
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Figure  5.2:  Each  value  below  0.5  along  the  solid  line  is  driven  to  zero  while 
each  value  above  0.5  is  driven  to  one,  yielding  the  segmented  dashed  line 
with  one-dimensional  “objects*  between  points  1  and  3,  points  3  and  7  and 
points  7  and  8. 


of  objects  is  then  rather  straightforward:  simply  find  the  points  where  “black” 
borders  “white." 


A  more  complicated  approach  to  this  problem  is  to  continually  average  the  values 
of  pixels  that  are  spatially  close  whenever  their  values  are  sufficiently  “near”  one 
another.  The  averaging  continues  until  it  produces  no  new  changes  (Hurlbert, 
1989).  The  idea  is  to  bring  the  values  of  the  pixels  within  each  object  closer 
together  while  widening  the  gap  between  pixel  values  across  objects  (see  figure 
5.3). 


Both  of  these  segmentation  algorithms  can  be  formulated  as  recurrent  problems. 
Hurlbert’s  algorithm  is  perfectly  suited  to  this.  In  fact,  it  is  described  as  a 
recurrent  problem.  In  practice,  the  network  only  has  to  leam  to  perform  an 
approximation  towards  local  averaging  at  each  time  step.  If  this  approximation  is 
more  or  less  accurate,  the  recurrent  nature  of  the  network  will  drive  it  to  the 
correct  final  answer. 
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Figure  5.3:  Using  Hurtberf  s  algorithm,  the  segmentation  of  the  solid  line 
from  figure  5.2  yields  the  same  three  “objects.*  However  the  pixel  values 
found  for  each  object  are  different  and  more  representative  of  the  original 
"image.* 


Similarly,  the  black-and-white  algorithm  can  be  successfully  implemented  so  iong 
as  pixel  values  less  than  the  threshold  always  move  towards  the  minimum  and 
pixel  values  above  the  threshold  always  move  towards  the  maximum  on  each 
time  step.  Given  enough  recurrent  iterations,  the  network  should  arrive  at  the 
correct  answer. 


5.3  Practical  Issues  in  Learning  Recurrent  Segmentation 

The  recurrent  network  poses  an  interesting  problem  of  prediction  and  control. 
The  get_next_states()  function  simply  returns  the  prediction  of  the  network  on  the 
current  state.  So  while  in  some  sense  the  network  is  producing  a  control  signal, 
it  is  more  accurate  to  say  that  it  is  creating  its  own  states  as  it  goes  along. 
Because  of  this,  the  state  space  is  infinite.  In  addition,  the  sequence  of  states 
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generated  by  the  network  can  be  infinitely  long.  As  we  have  noted  before,  both 
of  these  possibilities  can  present  difficulties  for  the  TD  (X)  algorithm. 

The  iterative  approach  of  recurrent  networks,  particularly  in  the  case  of 
segmentation,  would  suggest  that  each  successive  state  should  be  very  similar. 
This  would  indicate  that  a  small  value  for  X,  perhaps  even  zero,  would  be  most 
appropriate.  On  the  other  hand,  this  assumption  may  be  invalid.  The  task  is  not 
described  well  by  an  absorbing  Markov  process  and  so  may  prove  difficult  for 
values  of  X  close  to  zero,  as  seen  in  section  3.2.7. 


5.4  Experiments  with  Simple  Recurrent  Segmentation 

The  performance  of  the  TD  (X)  algorithm  was  tested  with  the  one-dimensional 
versions  of  both  segmentation  algorithms.  For  both  problems,  experiments  were 
conducted  for  values  of  X  beginning  with  zero  and  incremented  by  0.1  until  X 
reached  a  value  of  one. 

5.4.1  Black-and-white  Segmentation 

The  networks  trained  on  the  black-and-white  segmentation  problem  were  GRBF 
networks  containing  sixteen  centers.  Since  the  function  for  learning  this  task 
need  only  work  on  one  pixel  at  a  time,  the  input  to  this  network  was  simply  a 
scalar.  The  output  was  redirected  into  the  network  four  times.  All  input  values 
ranged  between  zero  and  one  with  a  threshold  value  of  0.5.  All  values  less  than 
the  threshold  were  associated  with  zero  and  all  values  greater  than  or  equal  to 
the  threshold  were  associated  with  one. 
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5.4.2  Huribert’s  Segmentation 

The  networks  trained  on  the  Hurlbert  segmentation  problem  were  GRBF 
networks  containing  twenty  eight  centers.  The  input  to  this  network  was  an  eight- 
dimensional  vector.  The  output  was  redirected  into  the  network  eight  times. 

The  correct  output  was  determined  by  repeatedly  averaging  a  pixel  with  its 
neighbor  using  a  simple  function: 

"  X<->N  a  40(1  k  “  xm\ *  a 

x=-  k  ~  ^  and  \x,  -  *I+1|  >  a  (5/|) 

k  -  *<-J >aand  k  -  xJ  *  ° 

*».  k  -  x-il  >  CTand  k  -  xm  I  >  a 

with  1  ^  i  £  8.  For  these  experiments,  o  was  set  to  0.2.  Pixel  values  just  off  the 
edges  of  the  input  vector  (/  =  0  and  /  =  9)  were  assumed  to  be  equal  to  «  in  order 
to  deal  correctly  with  the  boundary  conditions.  Equation  (5.1)  was  applied  in  a 
recurrent  manner  eight  times,  after  which  most  outputs  ceased  changing  in  any 
significant  way. 


5.5  Segmentation  Results 

Graphs  of  the  root  squared  error  for  each  experiment  can  be  seen  in  figure  5.4. 
For  both  kinds  of  problems,  the  networks  with  lower  values  for  X  exhibited  the 
worst  performance.  In  fact,  once  X  fell  below  a  certain  value,  the  root  squared 
error  remained  more  or  less  constant.  This  is  because  each  of  these  networks 
learned  approximately  the  same  function.  For  all  input  values,  the  network 
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Root  squared  error  for  Black-and- 
White  Segmentation 


Root  squared  enor  for  Hubert 
Segmentation 


(a)  (b) 

Figure  5.4:  In  a)  each  network  was  trained  on  ten  values  between  one  and  one  such  that  half  of  the 
correct  outputs  were  equal  to  zero  and  the  other  half  equal  one.  Networks  trained  with  values  for  X 
<  0.6  learned  a  constant  function,  driving  all  inputs  to  the  value  0.5.  In  b)  each  network  trained  with 
values  for  X  <  0.9  also  learned  a  constant  function,  driving  all  inputs  to  the  same  output  vector. 


produced  the  same  constant  value  at  each  stage  of  the  iteration.  The  value  of 
this  constant  depended  only  on  the  distribution  of  outputs  in  the  training  set. 

In  the  case  of  the  black-and-white  segmentation  problem,  the  constant  function 
learned  by  these  networks  was  simply  the  percentage  of  times  that  the  value  1.0 
occurred  in  the  output  of  the  training  set  (i.e.  the  average  of  the  outputs).  For  a 
training  set  with  half  the  outputs  equal  to  zero  and  the  other  half  equal  to  one,  the 
function  learned  was:  /(or)  =  0.5, 0  <  x  <  1 ;  for  a  training  set  with  zero  appearing 
only  30%  of  the  time,  the  function  learned  was:  f(x )  =  0.7, 0  <  x  <,  1  and  so  on. 

Similarly,  the  networks  training  on  the  Hurlbert  segmentation  task  learned  a 
constant  function  that  depended  on  the  distribution  of  the  outputs  of  the  training 
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set.  In  general,  the  constant  function  learned  was  simply  the  average  of  the 
outputs  presented  to  the  network. 

To  understand  this  result,  assume  that  our  training  data  for  the  black-and-white 
segmentation  problem  consists  of  two  values,  0.4  and  0.8.  Let  us  posit  these  two 
sequences  seen  by  the  network  early  in  the  learning  procedure: 

0.4:  0.2  0.5  0.8  0.7  0.9 

0.8:  0.3  0.9  0.6  0.4  0.2 

Naturally,  we  want  our  training  values,  0.4  and  0.8  to  be  associated  with  zero  and 
one,  respectively.  Note,  however,  that  0.8  also  appears  in  the  first  sequence  and 
0.4  appears  in  the  second.  This  means  that  zero  will  be  associated  with  both  0.4 
and  0.8.  Similarly,  one  will  be  associated  with  0.4  as  well  as  0.8. 

Since  the  network  sees  that  one  and  zero  are  associated  with  0.4  and  0.8 
equally,  a  maximum-likelihood  estimator  naturally  associates  these  values  with 
their  average,  0.5.  Similarly,  other  values  in  the  sequences  above  will  be 
associated  with  0.5  as  well.  In  other  words,  for  values  of  X  approaching  zero,  the 
network  does  exactly  what  we  would  expect  it  to  do.  A  similar  problem  occurs 
with  the  Hurlbert  segmentation  algorithm.  So  the  question  becomes:  how  can 
we  overcome  this  so  that  the  network  will  leam  the  function  that  we  want? 

The  answer  lies  in  the  non-Markovian  nature  of  the  recurrent  task.  There  is  no 
way  to  distinguish  between  the  0.4  in  the  beginning  of  the  first  sequence  and  the 
0.4  that  occurs  in  the  middle  of  the  second  sequence;  however,  in  this  task  they 
are  distinctly  different  entities.  The  problem  is  that  our  “state"  information  is  too 
impoverished. 


This  problem  occurs  even  in  the  weather  prediction  problem  discussed  in  chapter 
two.  There  we  were  using  weather  data  to  predict  whether  it  would  rain  on  a 
particular  Monday.  Let  us  suppose  that  in  one  sequence  we  have  a  rainy  day  on 
the  previous  Tuesday  and  in  another  sequence  we  have  a  rainy  day  on  the 
previous  Sunday.  If  the  Monday  in  the  first  sequence  is  sunny  and  the  Monday  in 
the  second  is  rainy,  we  will  associated  a  50%  probability  of  sunny  Mondays  with 
rainy  days.  In  actuality,  a  rainy  day  six  days  before  the  day  we  wish  to  predict 
isn’t  as  useful  a  predictor  as  a  rainy  day  one  day  before  the  day  we  wish  to 
predict.  Unfortunately,  we  have  not  distinguished  Tuesdays  from  Sundays. 

One  way  to  address  this  issue  is  to  find  some  way  to  lag”  the  states  so  as  to 
distinguish  initial  values  from  values  that  occur  later  in  a  sequence.  This 
possibility  is  explored  in  the  next  sections. 


5.6  Experiments  with  ‘Tagged”  Recurrent  Segmentation 

As  before,  the  performance  of  the  TD  (X)  algorithm  was  tested  with  the  one¬ 
dimensional  versions  of  both  segmentation  algorithms.  Again,  experiments  were 
conducted  for  values  of  X  beginning  with  zero  and  incremented  by  0. 1  until  X 
reached  a  value  of  one. 

5.6.1  Black-and-White  Segmentation 

In  this  experiment,  the  scalar  input  to  the  network  was  supplemented  by  an 
additional  component.  The  two  components  of  the  initial  vector  were  both  set  to 
the  same  scalar  value.  No  matter  what  the  two-dimensional  output  from  the 
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Figure  5.5:  The  root  squared  error  for  eleven  values  of  X  averaged  over  several  training 
trails  on  the  black-and-white  segmentation  task.  As  training  continued,  the  improvement 
rate  of  the  supervised  network  slowed. 

network  on  a  given  input,  the  second  component  was  always  transformed  into 
the  initial  scalar  value  before  being  redirected  into  the  network  as  input.  In  this 
way,  initial  states  were  distinguished  from  intermediate  states. 

As  before,  the  output  of  the  network  was  redirected  into  the  network  four  times. 
All  component  input  values  ranged  between  zero  and  one  with  a  threshold  value 
of  0.5.  For  the  first  component,  all  values  less  than  the  threshold  were 
associated  eventually  with  zero  and  all  values  greater  than  or  equal  to  the 
threshold  were  associated  with  one.  The  second  component  of  the  final  output 
vector  presented  to  the  network  was  always  the  initial  value  given  to  network. 

5.6.2  Hurlbert’s  Segmentation 

The  input  to  this  network  was  doubled  from  an  eight-dimensional  vector  into  a 
sixteen-dimensional  vector  and  the  output  from  the  network  transformed  in  a  way 
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Figure  5.6:  A  typical  aeries  of  sequences  generated  by  the  network.  This  is 
from  one  of  the  networks  trained  with  XsO.2  after  1000  epochs. 

similar  to  that  done  with  the  black-and-white  segmentation  problem.  As  before, 
the  output  was  redirected  into  the  network  eight  times. 


5.7  Segmentation  Results 

5.7.1  Black-and-White  Segmentation  Results 

Graphs  of  the  root  squared  error  can  be  seen  in  figure  5.5.  Unlike  before,  the 
networks  with  values  of  X  near  zero  did  not  learn  a  constant  function.  In  fact,  the 
networks  trained  with  values  of  X  *  1  became  increasingly  better  with  practice, 
outperforming  the  traditional  “supervised"  learning  procedure. 

Figure  5.6  shows  the  sequences  generated  by  one  of  the  networks.  Typically, 
values  were  moved  towards  the  correct  final  value  in  fairly  uniform  increments. 
Once  at  the  desired  value,  subsequent  outputs  remained  at  or  near  this  value. 
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Figure  5.7:  The  root  squared  enor  for  eleven  values  of  X  averaged  over  several  training 
trails  on  the  Hurbert  segmentation  task.  The  supervised  network  performed  best  initially; 
however,  the  other  networks  improved  with  repeated  practice. 


Finding  a  way  to  “tag"  the  states  seemed  to  return  the  necessary  Markovian  to 
the  task.  At  the  very  least  it  allowed  the  network  to  distinguish  between  initial 
states  and  intermediate  ones.  Presumably,  other  methods  of  tagging  could  be 
used  and  produce  the  same  effect,  at  least  for  these  simple  sorts  of  tasks. 

It  is  worth  noting  that  the  errors  discussed  in  this  section  only  take  into  account 
the  error  generated  on  the  one  component  of  interest;  however,  the  networks 
tended  to  learn  an  identity  function  for  the  second  component  and  so  the  errors 
including  this  component  did  not  add  significantly  to  the  overall  error. 

5.7.2  Hurlbert’s  Segmentation  Results 

Graphs  of  the  root  squared  error  can  be  seen  in  figure  5.7.  All  of  the  networks 
avoided  the  trap  of  finding  a  constant  function.  As  with  the  black-and-white 
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Figure  S.8:  The  dashed  Nne  represents  the  application  of  the  Huribert 
algorithm  on  the  solid  line. 


Figure  5.9:  The  application  of  one  of  the  neural  networks  trained  with  JU=0.5 
to  the  same  line  as  in  figure  5.8. 


segmentation  task,  values  were  moved  towards  the  correU  final  value  with  each 
step. 


This  segmentation  task  is  much  more  difficult  than  the  previous  task.  Even 
though  the  networks  performed  well  on  the  training  data,  all  of  the  networks 
generalized  poorly  and  often  were  unable  to  discover  a  useful  function  when 
trained  on  very  large  datasets,  indicating  that  perhaps  an  eight-dimensional 
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Figure  5.10  The  application  of  the  Huribert  algorithm  on  another  line. 
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Figure  5.11 :  The  application  of  the  network  from  figure  5.9  to  the  same  line 
as  in  figure  5.10.  Although  the  network  does  not  generate  the  same  values 
as  the  Huribert  algorithm,  it  does  segment  the  line  into  the  same  two  pieces. 

version  of  this  task  is  too  difficult  for  a  network.  A  more  reasonable  version  of  the 
task  might  use  only  three  dimensions,  the  actual  size  of  the  Huribert  “window," 
and  combine  copies  of  the  network  to  perform  segmentation  on  larger  images. 
An  approach  similar  to  this  has  been  explored  in  Jones  (1992). 


Still,  despite  the  problems  encountered  by  the  networks  with  this  particular  task, 
the  results  from  this  and  the  simpler  black-and-white  segmentation  problem  are 
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still  useful.  At  first  glance,  TD  (A)  networks  would  seem  ill-equipped  to  deal  with 
recurrent  problems  of  the  type  described  in  this  chapter.  After  all,  there  is  usually 
no  Markovian  process  underlying  these  tasks  and  the  “control*  signal  generated 
by  the  networks  is  not  really  constrained  to  generate  states  that  are  meaningful  in 
the  context  of  the  problem.  Indeed,  these  very  difficulties  arise  in  the  original 
formulation  of  the  segmentation  problems. 

However,  with  an  appropriate  mechanism  for  “tagging”  the  states  appropriately, 
the  networks  trained  with  A  *  1  may  not  only  perform  well,  but  they  may 
sometimes  perform  better  on  training  sets  and  generalize  better  than  networks 
trained  with  the  traditional  supervised  learning  algorithm.  If  this  is  indeed  the 
case,  the  increase  in  accuracy  and  ability  to  generalize  may  well  be  worth  the 
trade-off  in  the  increased  number  of  dimensions  and  the  corresponding  possible 
increase  in  learning  time  (and  in  this  case,  there  was  no  large  increase  in 
learning  time). 

In  particular,  the  training  time  for  the  tagged  black-and-white  task  did  not 
increase  significantly  and  the  networks  trained  with  A  *  1  performed  better  than 
the  network  trained  in  a  supervised  fashion  on  both  the  training  set  and  a  broader 
test  set.  On  the  Hurlbert  segmentation  task,  the  TD  networks  took  longer  to 
perform  as  well  as  the  supervised  network;  however  after  some  time  some  of  the 
TD  networks  appeared  to  be  slowly  outperforming  it.  This  proves  nothing,  but 
does  provide  some  hope  that  these  networks  can  often  outperform  their 
supervised  counterparts. 

Before  leaving  this  problem,  it  is  worth  noting  that  both  of  these  problems, 
particularly  the  black-and-white  segmentation  task,  are  defined  so  as  to  create 
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the  most  important  distinctions  between  initial  and  intermediate  states.  Where  an 
intermediate  state  appears  relative  to  other  intermediate  states  is  not  very 
important.  By  contrast,  in  the  weather  prediction  problem,  the  distance  between 
a  state  and  the  final  prediction  is  much  more  relevant. 

For  tasks  where  the  first  state  is  most  important,  it  may  be  useful  to  attempt  some 
way  to  change  the  depth-first  nature  of  the  TD  (A.)  procedure  into  a  breadth-first 
search  as  described  in  section  3.1.1.  For  tasks  like  the  weather  prediction 
problem,  this  would  not  seem  as  useful. 
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Chapter  6 


6  Conclusion 


There  has  been  some  theoretical  study  of  using  temporal  difference  algorithms  to 
address  issues  of  prediction,  outlining  some  distinct  advantages  of  these 
methods  over  more  traditional  supervised  learning  paradigms.  Nevertheless, 
there  has  been  little  evidence,  either  theoretical  or  empirical,  to  outline  the  limits 
of  these  methods  in  more  complex  real-world  domains  using  multilayer  networks. 

Even  as  researchers  such  as  Sutton  (1988)  have  shown  that  many  problems  that 
have  been  attacked  by  traditional  supervised  learning  algorithms  are  better 
understood  as  predictions  tasks,  we  have  sought  to  show  that  many  of  these 
problems  are  even  more  complicated,  involving  not  only  making  accurate 
predictions  about  the  environment,  but  also  about  using  those  intermediate 
predictions  to  actually  control  the  environment. 
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We  have  attempted  to  develop  a  general  formalism  for  looking  at  these  problems 
that  is  robust  enough  to  describe  tasks  involving  prediction  and  control — whether 
complete  or  partial — as  well  as  tasks  that  require  only  prediction.  We  have  used 
this  formalism  with  several  case  studies  to  explore  several  practical  issues  that 
arise  when  using  the  TD  (X)  algorithm. 

In  the  realm  of  games,  training  through  seif-play  seems  to  be  a  powerful  tool  for 
learning  robust  evaluation  functions.  This  tool  seems  to  provide  the  best 
opportunity  for  a  network  to  develop  the  skills  necessary  to  play  well.  By 
contrast,  networks  that  play  against  opponents  are  much  less  reliable  and  seem 
unable  to  move  towards  a  stable  evaluation  function,  either  discovering  local 
minima  or  oscillating  between  suboptimal  solutions. 

Further,  the  formalism  described  in  this  thesis  highlights  how  well  suited  the 
training  structure  of  the  TD  (X)  procedure  is  to  these  kinds  of  game-playing  tasks. 
Indeed,  one  of  this  procedure’s  strengths  is  the  way  in  which  it  deals  with  these 
sorts  of  naturally  sequential  problems. 

Beyond  this  kind  of  domain,  TD  (X)  algorithms  seem  capable  of  dealing  with 
tasks  that  appear  ill-defined  in  a  Markovian  sense.  Most  of  the  TD  networks 
performed  as  well  as  or  better  than  their  supervised  counterparts.  Furthermore, 
repeated  presentations  continue  to  improve  the  performance  of  these  network 
even  when  the  increased  performance  of  the  supervised  networks  would  slow 
down. 

Although  there  are  still  unanswered  questions,  these  results  suggest  various 
ways  to  best  take  advantage  of  TD  (X)  networks  on  various  kinds  of  tasks.  If 
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nothing  else,  these  results  suggest  that  TD  (X)  can  be  used  in  complex  domains 
without  completely  obscuring  our  ability  to  analyze  the  algorithm  and  its 
limitations.  We  propose  to  use  the  formalism  developed  in  this  paper  to  continue 
developing  these  case  studies  in  order  to  better  understand  the  practical 
questions  that  still  remain  unanswered  about  the  power  and  usability  of  TO  (X). 
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Appendix  A 


A:  TD  (X)  and  GRBF  networks 


In  section  2.2,  we  introduced  the  temporal  difference  update  rule  proposed  by 
Sutton  (1988): 


A w,  =  o(P„,-P,)XA'-*V.P,,  (2.5) 

where  w  represents  a  vector  of  the  updatable  parameters  of  the  network.  This 
update  rule  is  generally  applicable;  however,  it  uses  a  notation  that  is  usually 
associated  with  perceptron-like  feedforward  networks  (see  section  1.1).  With 
these  kinds  of  networks,  not  only  is  the  same  function  usually  associated  with 
each  computational  unit,  but  each  component  of  w  serves  the  same  purpose. 
Each  of  parameter  is  used  as  a  transforming  agent  between  two  computational 

units  or  “neurons",  weighting  the  output  of  one  of  these  units  before  passing  it  to 
the  other.  In  particular,  if  xj  denotes  the  output  of  the  j*  unit  of  the  network  and 
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vv-  denotes  the  weight  on  the  connection  from  unit  i  to  j  (where  is  allowed  to 
be  zero),  the  output  of  unit  j  can  be  expressed  as: 


xj = /(X  ***.)• 


(1.1) 


For  the  case  studies  explored  in  this  thesis,  however,  we  use  a  Guassian  GRBF 
network  as  described  by  Poggio  and  Girosi  (1990): 

=  0-5) 


With  this  kind  of  network,  the  purpose  of  each  kind  of  adjustable  parameter  is 
different.  It  is  useful  to  keep  these  differences  in  mind  and  explore  the  way  in 
which  equation  (2.5)  must  be  applied  to  update  these  parameters.  For  equation 
(1 .5),  we  divide  the  update  rule  into  three  rules,  where  each  rule  reflects  the 
details  of  each  type  of  parameter:  the  coefficients,  c  ■  the  centers,  F ;  and  the  d 
parameters,  which  are  particular  to  the  gaussian  radial  basis  function  and  define 
the  extent  of  each  gaussian.  Since  the  important  difference  between  each 
update  rule  is  the  form  of  its  gradient,  we  will  focus  on  how  it  changes  for  each 
type  of  parameter. 


For  the  i*  coefficient,  cit  the  gradient  is: 


Notice  that  each  coefficient  can  be  a  vector,  allowing  the  output  of  the  network  to 
be  a  vector  of  the  same  dimensions.  Since  V?  P  is  a  scalar,  substituting  the 

gradient  into  equation  (2.5)  yields  a  vector  of  the  appropriate  size,  the  dimension 
of  the  output  of  the  network: 

A cl  =  «(/>.,  -  .  (A  .2) 

*«i 


For  the  i*  center,  t, ,  the  gradient  is: 


V-  P  =  2  <yjcie~a,^~T‘  ^  ( x  -  ^ . 


(A.3) 


This  gradient  is  a  matrix  (with  rank  the  dimension  of  the  output  of  the  network 
and  order  the  dimension  of  the  input  to  the  network).  Substituting  in  equation 
(2.5)  yields  a  vector  of  the  appropriate  size,  the  dimension  of  the  input  to  the 
network: 


A<7  =  «(/>„, (*-<))• 


(A.4) 


For  the  i*  a,  the  gradient  is: 


(A.5) 
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This  gradient  is  a  vector  in  the  dimension  of  the  output.  Substituting  in  equation 
(2.5)  yields  a  scalar. 


W  =  a(PM  -P,)£  I*  -  if .  (A.6) 

The  intuitive  discussion  in  chapter  two,  and  a  similar  discussion  from  Sutton 
(1988)  and  Dayan  (1991),  applies  for  each  of  these  update  rules.  Formally, 
these  update  rules  differ  from  Sutton's  examples  in  that  these  networks  are  non¬ 
linear  and  the  proofs  that  show  convergence  for  TD  (X)  do  not  apply  directly; 
however,  this  is  also  true  of  the  non-linear  multi-layer  perceptrons  used  by 
Tesauro  (1992). 
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